From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id D6BFE3858430 for ; Fri, 29 Oct 2021 19:13:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D6BFE3858430 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-582-vhE7DfpvOKGwfHe84i1C4g-1; Fri, 29 Oct 2021 15:13:09 -0400 X-MC-Unique: vhE7DfpvOKGwfHe84i1C4g-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3DB4018125C0; Fri, 29 Oct 2021 19:13:08 +0000 (UTC) Received: from oldenburg.str.redhat.com (unknown [10.39.194.250]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DBBBF5D9CA; Fri, 29 Oct 2021 19:13:06 +0000 (UTC) From: Florian Weimer To: gcc@gcc.gnu.org Cc: Jason Merrill , Jakub Jelinek , libc-alpha@sourceware.org Subject: GCC DWARF unwinder _Unwind_Find_FDE acceleration Date: Fri, 29 Oct 2021 21:13:04 +0200 Message-ID: <87a6irir0f.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=unavailable autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 29 Oct 2021 19:13:12 -0000 I've got a mostly-complete implementation of a faster _Unwind_Find_FDE implementation which lock-free and async-signal-safe. The core idea is to turn into effectively this. const fde * _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases) { struct unw_eh_callback_data data; const fde *ret; ret = _Unwind_Find_registered_FDE (pc, bases); if (ret != NULL) return ret; void *eh_frame = _dl_find_eh_frame (pc); if (eh_frame == NULL) return NULL; struct find_fde_tail_result result = find_fde_tail ((_Unwind_Ptr) pc, eh_frame, NULL); if (result.entry != NULL) { bases->tbase = NULL; bases->dbase = NULL; bases->func = result.func; } return result.entry; } _dl_find_eh_frame is implemented by glibc and is very similar to __gnu_Unwind_Find_exidx, except that it returns the PT_GNU_EH_FRAME address. The real implementation of _Unwind_Find_FDE (not the simplified one above) can use a weak symbol for _dl_find_eh_frame and has still the old dl_iterate_phdr-based implementation as a fallback (for non-glibc older glibc). There's also some hackery for passing around the dbase value for DW_EH_PE_datarel. I split the common tail of the old and new implementations into find_fde_tail, included below for reference. Previously, the tail was covered by the loader lock implied by dl_iterate_phdr. With the new implementation, that's no longer true. I do not think this matters because in the current implement in trunk, we continue DWARF parsing after _Unwind_Find_FDE returns, relative to the returned FDE pointer. A concurrent dlclose will proceed to unmap the data while it is being read. So the race condition already exists on trunk, and we can have crashes due to concurrent dlclose of code for a stack frame that is processed by unwinding. The new implementation I have is as fast as the trunk implementation for the best case for the trunk implementation: when there are no threads and we achieve 100% cache hit rate. Even with 1000 shared objects (of which less than 8 participate in unwinding, to achieve 100% cache hits), my new implementation seems to be 10% faster. Once the cache isn't 100% effective, performance is really bad on the trunk, and likewise for multi-threading. (There are no writes to shared data in my new implementation and only two 64-bit acquire-MO loads (worst case), so benchmarking multi-threaded workloads against the trunk implementation isn't very interesting. 32-bit support will likely need 4 acquire-MO loads.) There is a slight performance regression with lots and lots of dlopen calls because the new data structures for the EH frame location are somewhat costly to upgrade (but dlopen still spends twice as much time in strcmp for soname matching, so I'm not worried). There's room for one more straightforward optimization: move _Unwind_Find_FDE into glibc, and not just its PT_GNU_EH_FRAME lookup. We would have to teach glibc a little bit about the PT_GNU_EH_FRAME layout (the parts parsed in find_fde_tail below). Once in glibc, we can easily cache the presence and location of the table of FDEs used in the binary search inside find_fde_tail. This means we can start the search right away. But maybe this isn't necessary given that my new implementation is already as fast as the old one even in the cases most benefiting the old implementation. A future implementation of _Unwind_Find_FDE inside glibc could also cache the PC-to-FDE mapping inside glibc, skipping the binary search for the FDE and taking advantage of efficient cache invalidation during dlopen and dlclose. For an object of 20,000 functions, the binary search in find_fde_tail takes about 10% to 15% of the time during unwinding, so it's not insignificant. But it's also unclear if we can actually a achieve a decent cache hit rate for any useful cache size. So I would like to stick to my current approach, with a function like _dl_find_eh_frame. I will hopefully complete 32-bit support and tests next week, so that I have patches to post. Comments? Thoughts? Thanks, Florian /* Result type of find_fde_tail below. */ struct find_fde_tail_result { const fde *entry; void *func; }; /* Find the FDE for the program counter PC, in a previously located PT_GNU_EH_FRAME data region. */ static struct find_fde_tail_result find_fde_tail (_Unwind_Ptr pc, const struct unw_eh_frame_hdr *hdr, _Unwind_Ptr dbase) { const unsigned char *p = (const unsigned char *) (hdr + 1); _Unwind_Ptr eh_frame; struct object ob; if (hdr->version != 1) return (struct find_fde_tail_result) { NULL, }; p = read_encoded_value_with_base (hdr->eh_frame_ptr_enc, base_from_cb_data (hdr->eh_frame_ptr_enc, dbase), p, &eh_frame); /* We require here specific table encoding to speed things up. Also, DW_EH_PE_datarel here means using PT_GNU_EH_FRAME start as base, not the processor specific DW_EH_PE_datarel. */ if (hdr->fde_count_enc != DW_EH_PE_omit && hdr->table_enc == (DW_EH_PE_datarel | DW_EH_PE_sdata4)) { _Unwind_Ptr fde_count; p = read_encoded_value_with_base (hdr->fde_count_enc, base_from_cb_data (hdr->fde_count_enc, dbase), p, &fde_count); /* Shouldn't happen. */ if (fde_count == 0) return (struct find_fde_tail_result) { NULL, }; if ((((_Unwind_Ptr) p) & 3) == 0) { struct fde_table { signed initial_loc __attribute__ ((mode (SI))); signed fde __attribute__ ((mode (SI))); }; const struct fde_table *table = (const struct fde_table *) p; size_t lo, hi, mid; _Unwind_Ptr data_base = (_Unwind_Ptr) hdr; fde *f; unsigned int f_enc, f_enc_size; _Unwind_Ptr range; mid = fde_count - 1; if (pc < table[0].initial_loc + data_base) return (struct find_fde_tail_result) { NULL, }; else if (pc < table[mid].initial_loc + data_base) { lo = 0; hi = mid; while (lo < hi) { mid = (lo + hi) / 2; if (pc < table[mid].initial_loc + data_base) hi = mid; else if (pc >= table[mid + 1].initial_loc + data_base) lo = mid + 1; else break; } gcc_assert (lo < hi); } f = (fde *) (table[mid].fde + data_base); f_enc = get_fde_encoding (f); f_enc_size = size_of_encoded_value (f_enc); read_encoded_value_with_base (f_enc & 0x0f, 0, &f->pc_begin[f_enc_size], &range); void *func = (void *) (table[mid].initial_loc + data_base); if (pc < table[mid].initial_loc + data_base + range) return (struct find_fde_tail_result) { f, func }; else return (struct find_fde_tail_result) { NULL, func }; } } /* We have no sorted search table, so need to go the slow way. As soon as GLIBC will provide API so to notify that a library has been removed, we could cache this (and thus use search_object). */ ob.pc_begin = NULL; ob.tbase = NULL; ob.dbase = (void *) dbase; ob.u.single = (fde *) eh_frame; ob.s.i = 0; ob.s.b.mixed_encoding = 1; /* Need to assume worst case. */ const fde *entry = linear_search_fdes (&ob, (fde *) eh_frame, (void *) pc); if (entry != NULL) { _Unwind_Ptr func; unsigned int encoding = get_fde_encoding (entry); read_encoded_value_with_base (encoding, base_from_cb_data (encoding, dbase), entry->pc_begin, &func); return (struct find_fde_tail_result) { entry, (void *) func }; } return (struct find_fde_tail_result) { NULL, }; }