public inbox for gnu-gabi@sourceware.org
 help / color / mirror / Atom feed
* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                               ` H.J. Lu
@ 2017-01-01  0:00                                 ` Suprateeka R Hegde
  2017-01-01  0:00                                   ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi

On 03/27/17 21:45, H.J. Lu wrote:
> 
> There is a way to support GNU_MBIND segments without the glibc changes.
> Instead, dl_iterate_phdr
> 
> int dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
>                                       size_t size, void *data),
>                      void *data);
> 
> is called via the .init_array section to process GNU_MBIND segments in
> executable and shared objects:
> 
> static int
> callback (struct dl_phdr_info *info, size_t size, void *data)
> {
>   Compute the load address of the current module.
>   if info->dlpi_addr == the load address of the current module
>     {
>       check ELF program headers and process GNU_MBIND segments
>       return 1;
>     }
> 
>   return 0;
> }
> 
> static void
> call_gnu_mbind_setup (void)
> {
>   dl_iterate_phdr (callback, NULL);
> }
> 
> static void (*init_array) (void)
>  __attribute__ ((section (".init_array"), used))
>  = &call_gnu_mbind_setup;

This looks very ideal and perfect and matches my requirement too. Are
you suggesting this dl_iterate_phdr(3) as the way in your proposal
instead of the __gnu_mbind_setup?

Or are you suggesting that for all the implementations  that need
different arguments (like that of my NVM) compared to
__gnu_mbind_setup_v1, we go with this dl_iterate_phdr(3) way?

I am OK either way.

However, I am just thinking that your earlier approach --
__gnu_mbind_setup -- is better when shared libraries with GNU_MBIND
segments are dlopen'ed. They dont have to iterate all over again to
reach their PHDR. Or what is the recommendation for such dlopen'ed
libraries?

And this dl_iterate_phdr(3) not being part of any standards, may change
in a totally incompatible way in the future.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00 ` Suprateeka R Hegde
@ 2017-01-01  0:00   ` H.J. Lu
  2017-01-01  0:00     ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: gnu-gabi

On Thu, Mar 2, 2017 at 7:16 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 23-Feb-2017 09:49 PM, H.J. Lu wrote:
>>
>>  The default implementation of __gnu_mbind_setup is
>>
>> int
>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>> {
>>   return 0;
>> }
>>
>> which can be overridden by a different implementation at link-time.
>>
>
> Since this is a design that allows vendor specific extension and
> implementation, would it OK if we make it more generic?

Yes.

> Instead of a fixed 3 arguments (type, addr, len), how about something like a
> pointer to a generic MBIND_CONTEXT struct (say of type __gnu_mbind_context
> defined)?  And let the implementation define the actual struct.

We can add more arguments.  But they must be predefined since
__gnu_mbind_setup is called from ld.so which must know what to
pass to __gnu_mbind_setup.

> I would like to handle NVM/NVMe (long back I had mentioned about
> PT_PERSISTENT) through this MBIND and my implementation of handling NVM/NVMe
> needs more data to be passed to such "setup" functions.

I call it MBIND since a MBIND segment is inside a LOAD segment and
my real __gnu_mbind_setup will call mbind to move a MBIND region to
a NUMA node after it has been loaded and relocated. We can give it
a different name if you have a better one.

> Or is this __gnu_mbind_setup should be considered as a very basic /
> fundamental function (used just to setup the "memory area") and
> implementations/vendors are expected to write wrapper/handler functions to
> handle other aspects of the special memory? In that case the fixed set of
> basic args looks OK.

That is correct.  __gnu_mbind_setup is platform specific.  We can pass as
much as we need to __gnu_mbind_setup.  But they have to be fixed.

> IMHO this __gnu_mbind_setup is a very good design to be generic enough and
> not be very specific/basic/fundamental runtime support.
>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                   ` H.J. Lu
  2017-01-01  0:00                     ` Florian Weimer
@ 2017-01-01  0:00                     ` Suprateeka R Hegde
  2017-01-01  0:00                       ` H.J. Lu
  1 sibling, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi

On 16-Mar-2017 03:33 AM, H.J. Lu wrote:
> Run-time support
> 
> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);

I still think the above interface is specific to MCDRAM proposal and is
not generic enough for other special memory types. I would prefer it
more in the form of something I showed.

Or something even better than what I showed for the ABI support
documentation and hence glibc implementation.

> 
> It sets up special memory area of 'type' and 'length' at 'addr' where
> 'addr' is a multiple of page size.  It returns zero for success, positive
> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
> error.

Let me rephrase what I want to know or what I am telling. Assume my
interface of __gnu_mbind_setup is as follows:

int __gnu_mbind_setup (unsigned int type, __nvm_kmem_t *nvm_kmem_obj,
							size_t length);

Then, What is the expected way to add the default implementation for
this vendor specific interface in glibc? Because this differs from the
interface seen in glibc (from your patch).

Are you saying we should use #ifdef for default implementation of every
vendor? And hence a vendor specific ld.so? I dont think this is what you
meant as this has lot of obvious problems.

Or are you saying that we do not add vendor specific default
implementations at all in glibc and just keep one interface and one
default implementation that you have mentioned? And then the real
implementation (with a different interface) would override glibc one
(with the interface you have defined)?

If that is what you are designing, it looks like the overriding is
purely based on the symbol name and not the the full interface of the
function. Personally I feel this is a overriding hack based on
linker/loader symbol resolution magic.

Since ld.so is not meant only for programs with C style linkage, what if
the real implementation library is written in C++ and wants to export
only mangled names (interfaces) without any "extern C" kludge? Or is
this considered to be a standard C library call just like mmap etc.?

And you may also want to define the flow for fully archive bound static
binaries.

Assuming I am not missing anything above, if you still want to keep the
interface as defined by you currently, I am OK with that. But we should
at least add a couple of lines in addition, something like:

"...which can be overridden by a different implementation at link-time.
Such an implementation is required to provide a C style (unmangled)
__gnu_mbind_setup function. However, the arguments and return type of
the function need not match the one defined here"

BTW, what if the real implementation library also includes stddef.h?
Wont there be prototype difference if the vendor
implementation/interface is different?

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00         ` Florian Weimer
@ 2017-01-01  0:00           ` H.J. Lu
  0 siblings, 0 replies; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Carlos O'Donell, gnu-gabi

On Wed, Mar 1, 2017 at 8:49 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 02/28/2017 06:03 PM, H.J. Lu wrote:
>>
>> On Tue, Feb 28, 2017 at 8:19 AM, Carlos O'Donell <carlos@redhat.com>
>> wrote:
>>>
>>> On 02/23/2017 09:59 PM, H.J. Lu wrote:
>>>>>
>>>>> Why does it run _after_ all shared objects and the executable file are
>>>>> loaded?
>>>>
>>>>
>>>> Since __gnu_mbind_setup may call any external functions, it can only
>>>> be done after everything is loaded and relocated.
>>>
>>>
>>> Who defines this function?
>>
>>
>> Platform vendor with special memory support should provide such function.
>>
>>> Where is it implemented?
>>
>>
>> We are working on libmbind to implement it.
>
>
> That's backwards.  Either we need to merge libmbind in to glibc, or this
> should be something provided by the kernel vDSO.

I don't think libnuma belongs to glibc nor kernel.

> We certainly don't want to repeat the mistake with the unwinder and
> libgcc_s.

__gnu_mbind_setup is kind of like malloc.  The default __gnu_mbind_setup
in glibc just returns 0, which can be overridden by the one from libmbind.

>>> Why can't this be run in a constructor? Is that too late?
>>
>>
>> We can use MCDRAM for dynamically allocated memory with
>> memkind.  We are looking for a user-friendly way to use MCDRAM
>> for normal data variables.
>
>
> Is it really necessary to avoid the pointer indirection?

Yes.  We found that memkind wasn't sufficient.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                     ` Florian Weimer
@ 2017-01-01  0:00                       ` H.J. Lu
  2017-01-01  0:00                         ` Florian Weimer
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Suprateeka R Hegde, Carlos O'Donell, gnu-gabi

On Thu, Mar 16, 2017 at 1:40 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 03/15/2017 11:03 PM, H.J. Lu wrote:
>>
>> After all shared objects and the executable file are loaded, relocations
>> are processed, for each GNU_MBIND segment in a shared object or the
>> executable file, run-time loader calls __gnu_mbind_setup with type,
>> address and length.  The default implementation of __gnu_mbind_setup is
>
>
> Is there a specified invocation order for the segments?
>
> Does the call happen immediately after relocations for an object are
> processed, or only after relocations for all objections are processed?
>
> If the latter, why can't you use the existing ELF constructor mechanism for
> this?  As far as I understand it, the call to __gnu_mbind_setup would just
> happen before the constructor calls.

That is correct.  The issue is to access the ELF segment header for each
loaded object only once.  There is no good way to get this info from
constructor.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00   ` H.J. Lu
@ 2017-01-01  0:00     ` Suprateeka R Hegde
  2017-01-01  0:00       ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gnu-gabi

On 02-Mar-2017 09:34 PM, H.J. Lu wrote:
> On Thu, Mar 2, 2017 at 7:16 AM, Suprateeka R Hegde
> <hegdesmailbox@gmail.com> wrote:
>> On 23-Feb-2017 09:49 PM, H.J. Lu wrote:
>>>
>>>  The default implementation of __gnu_mbind_setup is
>>>
>>> int
>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>>> {
>>>   return 0;
>>> }
>>>
>>> which can be overridden by a different implementation at link-time.
>>>
>>
>> Since this is a design that allows vendor specific extension and
>> implementation, would it OK if we make it more generic?
>
> Yes.
>
>> Instead of a fixed 3 arguments (type, addr, len), how about something like a
>> pointer to a generic MBIND_CONTEXT struct (say of type __gnu_mbind_context
>> defined)?  And let the implementation define the actual struct.
>
> We can add more arguments.  But they must be predefined since
> __gnu_mbind_setup is called from ld.so which must know what to
> pass to __gnu_mbind_setup.

<snip>

> __gnu_mbind_setup is platform specific.  We can pass as
> much as we need to __gnu_mbind_setup.  But they have to be fixed.

I didnt understand this. Predefined/fixed where and how? As part of the 
ABI support? Or part of the implementation?

If it is going to part of ABI, then we need to finalize and define it 
now. If it is part of the implementation, then __gnu_mbind_setup must be 
documented in the ABI as just a guideline to the implementer.

What I meant by being generic is to have flexibility in number of 
arguments. Here is an incomplete/pseudo code of what I am trying to tell:

enum __gnu_mbind_instance {
    GNU_MBIND_DEFAULT,
    GNU_MBIND_OVERRIDDEN
};

typedef struct {
    __gnu_mbind_instance mb_inst; // mandatory. ABI specified
#ifdef VENDOR_1
    type1 identifier1; // optional implementation defined
    type2 identifier2; // optional implementation defined
    ...
    typeN identifierN; // optional implementation defined
#endif
// Add vendor/implementation as necessary
} __gnu_mbind_context;

int __gnu_mbind_setup(__gnu_mbind_context*);

Now, for MCDRAM instance, from my understanding:

typedef struct {
    __gnu_mbind_instance mb_inst;
#ifdef VENDOR_MCDARM
    unsigned int type;
    void *addr;
    size_t length;
#endif
} __gnu_mbind_context;

int
__gnu_mbind_setup (__gnu_mbind_context* mbind_context)
{
    /* Even if real implementation exist, check and allow disabling 
special memory bindings */
    if (mbind_context->mb_inst == GNU_MBIND_DEFAULT) {
       return 0;
    }
    else {
       // return real_implementation_result;
    }
}

If that is not what you are telling, I want to understand how to use it 
vendor specific way.


>> I would like to handle NVM/NVMe (long back I had mentioned about
>> PT_PERSISTENT) through this MBIND and my implementation of handling NVM/NVMe
>> needs more data to be passed to such "setup" functions.
>
> I call it MBIND since a MBIND segment is inside a LOAD segment and
> my real __gnu_mbind_setup will call mbind to move a MBIND region to
> a NUMA node after it has been loaded and relocated. We can give it
> a different name if you have a better one.

I dont have any comment on this MBIND naming. It sounds good.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00           ` Suprateeka R Hegde
@ 2017-01-01  0:00             ` H.J. Lu
  2017-01-01  0:00               ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Mon, Mar 6, 2017 at 5:25 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 04-Mar-2017 07:37 AM, Carlos O'Donell wrote:
>>
>> On 03/03/2017 11:00 AM, H.J. Lu wrote:
>>>
>>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>>> it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
>>> have to be used by all implementations nor all memory types.
>>
>>
>> I think what Supra is suggesting is a pointer-to-implementation interface
>> which would allow ld.so to pass completely different arguments to the
>> library depending on what kind of memory is being defined by the sh_info
>> value. It avoids needing to encode all the types in the API, and just
>> uses an incomplete pointer to the type.
>
>
> Thats absolutely right.
>
> However, I am not suggesting one is better over the other. I just want to
> get clarity on how the code looks like for different implementations.
>
> On 03-Mar-2017 09:30 PM, H.J. Lu wrote:
>>
>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>> it needs to know what to pass to __gnu_mbind_setup.
>
>
> So I want to know what is that ONE-FIXED-FORM of __gnu_mbind_setup being
> called by ld.so.
>
>>  Not all arguments
>> have to be used by all implementations nor all memory types.
>
>
> I think I am still not getting this. Really sorry for that. Would it be
> possible for you to write a small pseudo code that depicts how this design
> looks like for different implementations?
>

For my usage, I only want to know memory type, address and its size:

#define _GNU_SOURCE
#include <unistd.h>
#include <errno.h>
#include <stdint.h>
#include <cpuid.h>
#include <numa.h>
#include <numaif.h>
#include <mbind.h>

#ifdef LIBMBIND_DEBUG
#include <stdio.h>
#endif

/* High-Bandwidth Memory node mask.  */
static struct bitmask *hbw_node_mask;

/* Initialize High-Bandwidth Memory node mask.  This must be called before
   __gnu_mbind_setup.  */
static void
__attribute__ ((used, constructor))
init_node_mask (void)
{
  if (__get_cpuid_max (0, 0) == 0)
    return;

  /* Check if vendor is Intel.  */
  uint32_t eax, ebx, ecx, edx;
  __cpuid (0, eax, ebx, ecx, edx);
  if (!(ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69))
    return;

  /* Get family and model.  */
  uint32_t model;
  uint32_t family;
  __cpuid (1, eax, ebx, ecx, edx);
  family = (eax >> 8) & 0x0f;
  if (family != 0x6)
    return;
  model = (eax >> 4) & 0x0f;
  model += (eax >> 12) & 0xf0;

  /* Check for KNL and KNM.  */
  switch (model)
    {
    default:
      return;

    case 0x57: /* Knights Landing.  */
    case 0x85: /* Knights Mill.  */
      break;
    }

  /* Check if NUMA configuration is supported.  */
  int nodes_num = numa_num_configured_nodes ();
  if (nodes_num < 2)
    return;

  /* Get MCDRAM NUMA nodes.  */
  struct bitmask *node_mask = numa_allocate_nodemask ();
  struct bitmask *node_cpu = numa_allocate_cpumask ();

  int i;
  for (i = 0; i < nodes_num; i++)
    {
      numa_node_to_cpus (i, node_cpu);
      /* NUMA node without CPU is MCDRAM node.  */
      if (numa_bitmask_weight (node_cpu) == 0)
numa_bitmask_setbit (node_mask, i);
    }

  if (numa_bitmask_weight (node_mask) != 0)
    {
      /* On Knights Landing and Knights Mill, MCDRAM is High-Bandwidth
Memory.  */
      hbw_node_mask = node_mask;
    }
  else
    numa_bitmask_free (node_mask);
  numa_bitmask_free (node_cpu);
}

/* Support all different memory types.  */

static int
mbind_setup (unsigned int type, void *addr, size_t length,
    unsigned int mode, unsigned int flags)
{
  int err = ENXIO;

  switch (type)
    {
    default:
#ifdef LIBMBIND_DEBUG
      printf ("Unsupported mbind type %d: from %p of size %p\n",
     type, addr, length);
#endif
      return EINVAL;

    case GNU_MBIND_HBW:
      if (hbw_node_mask)
err = mbind (addr, length, mode, hbw_node_mask->maskp,
    hbw_node_mask->size, flags);
      break;
    }

  if (err < 0)
    err = errno;

#ifdef LIBMBIND_DEBUG
  printf ("Mbind type %d: from %p of size %p\n", type, addr, length);
#endif

  return err;
}

int
__gnu_mbind_setup (unsigned int type, void *addr, size_t length)
{
  return mbind_setup (type, addr, length, MPOL_BIND, MPOL_MF_MOVE);
}

If other memory types need additional information, they can be
passed to __gnu_mbind_setup.  We just need to know what
information is needed.


-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                         ` Suprateeka R Hegde
@ 2017-01-01  0:00                           ` H.J. Lu
  2017-01-01  0:00                             ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Fri, Mar 17, 2017 at 11:12 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On Friday 17 March 2017 02:55 AM, H.J. Lu wrote:
>>> Since ld.so is not meant only for programs with C style linkage, what if
>>> the real implementation library is written in C++ and wants to export
>>> only mangled names (interfaces) without any "extern C" kludge? Or is
>>> this considered to be a standard C library call just like mmap etc.?
>>
>> Only the __gnu_mbind_setup symbol is used.  We can change the
>> second argument to "void *data" and make it dependent on memory
>> type.  But to support a new memory type, we have to update ld.so.  I'd
>> like to use the same ld.so binary to support any memory types even if
>> it means that we need to pass info to __gnu_mbind_setup which isn't
>> used by all memory types.
>
> Ah! Now I understand the design completely (I think). Looks like Carlos
> understood this quite earlier in the discussion.
>
> You are saying that the interface -
>
> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>
> - is fixed in ld.so and also in the real implementation library. And,
> the real implementation in turn calls the actual-real-implementation, as
> shown in your libmbind code:
>
> int
> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
> {
>   // in turn calls actual implementation
>   return vendor_specific_mbind_setup (vendor specific types);
> }
>
>
> All these while, based on the current description, I was of the
> impression that your design allows __gnu_mbind_setup interface itself to
> be overridden in the real implementation, something like:
>
> int
> __gnu_mbind_setup (__nvm_kmem_t *nvm_obj, void *nvm_handle)
> {
>   // actual implementation directly here in the body
> }
>
> So I was wondering how and hence most of my points were out-of-phase.
>
>>  The question is what the possible info needed
>> for all memory types is.
>
> Thats too much to predict right now. And the current interface you
> defined also does not seem to be generic. For instance, my NVM
> implementation, though not complete, needs a totally different set of
> arguments. So going by the current design, I will have to use
> __gnu_mbind_setup (unsigned int type, void *addr, size_t length) just to
> call my real setup, without using any of the arguments passed by ld.so.
>
> Assuming I am in sync with you now, I would say that the pseudo code I
> showed earlier works for you as well as for me as well as for anybody
> else. In other words it is more generic.
>
> With that approach, there is
>
> 1. No need to update ld.so every time for every new mem type
> 2. No need to know all possible info needed for all mem types
> 3. No need to encode all types in the API (as Carlos said)
>
> We just use pointer to implementation interface - struct
> __gnu_mbind_context that I showed. And we can have a default struct
> instantiated in ld.so and a global pointer pointing to that. And later
> the global pointer can be made to point to the vendor specific struct,
> before ld.so actually calls __gnu_mbind_setup, thereby completing a
> successful override (if necessary, that is when special memory types are
> in use).
>
> Or similar mechanisms to override default struct instantiated in ld.so.
> There are many well known ways to override the default struct as we all
> know.
>
> Personally I think this would be a better way to provide the ABI support
> in a generic way.

ld.so needs to call the real __gnu_mbind_setup implementation
with the correct argument.   We can keep it ASIS and add a new
new one, __gnu_mbind_setup_v2, if needed.

> That said, I am OK to live with minor kludges and we can keep the design
> as is.
>
>>
>>> And you may also want to define the flow for fully archive bound static
>>> binaries.
>>
>> For static executable, __gnu_mbind_setup will be called on all MBIND
>> segments before constructors are called.  __gnu_mbind_setup in libc.a
>> is weak and will be overridden by the real one in libmbind.a.
>
> Lets add this also in the ABI support document.
>

How about this:

Run-time support

int __gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length);

It sets up special memory area of 'type' and 'length' at 'addr' where
'addr' is a multiple of page size.  It returns zero for success, positive
value of ERRNO for non-fatal error and negative value of ERRNO for fatal
error.

After all shared objects and the executable file are loaded, relocations
are processed, for each GNU_MBIND segment in a shared object or the
executable file, run-time loader calls __gnu_mbind_setup_v1 with type,
address and length.  If __gnu_mbind_setup_v1 must be defined in run-time
loader, it should be implemented as a weak function:

int
__gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length)
{
  return 0;
}

in run-time loader so that the GNU_MBIND run-time library isn't required
for normal executable nor shared object.  The real implementation of
__gnu_mbind_setup_v1 should be in the GNU_MBIND run-time library and
overridde the weak one in run-time loader.



-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                         ` Florian Weimer
@ 2017-01-01  0:00                           ` H.J. Lu
  2017-01-01  0:00                             ` Florian Weimer
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Suprateeka R Hegde, Carlos O'Donell, gnu-gabi

On Mon, Mar 20, 2017 at 7:57 AM, Florian Weimer <fweimer@redhat.com> wrote:
> On 03/16/2017 07:22 PM, H.J. Lu wrote:
>
>>> If the latter, why can't you use the existing ELF constructor mechanism
>>> for
>>> this?  As far as I understand it, the call to __gnu_mbind_setup would
>>> just
>>> happen before the constructor calls.
>>
>>
>> That is correct.  The issue is to access the ELF segment header for each
>> loaded object only once.  There is no good way to get this info from
>> constructor.
>
>
> I think you can get the data in a pretty straightforward manner using
> dlinfo.

dlinfo is used to info from application.  I don't see how it can be used
here.

> I expect that libraries such as bdwgc might want to use the
> __gnu_mbind_setup callback as well, just to register freshly loaded shared

Did you mean to mark pieces of memory garbage collectible? I guess it may
work.

> objects and their data sections.  Can we make this work for multiple users?
>

What did you mean by "multiple users"?  My proposal targets process memory
address space.  It doesn't forbid sharing memory addresses among different
processes.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                     ` Suprateeka R Hegde
@ 2017-01-01  0:00                       ` H.J. Lu
  2017-01-01  0:00                         ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Thu, Mar 16, 2017 at 10:18 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 16-Mar-2017 03:33 AM, H.J. Lu wrote:
>> Run-time support
>>
>> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>
> I still think the above interface is specific to MCDRAM proposal and is
> not generic enough for other special memory types. I would prefer it
> more in the form of something I showed.
>
> Or something even better than what I showed for the ABI support
> documentation and hence glibc implementation.
>
>>
>> It sets up special memory area of 'type' and 'length' at 'addr' where
>> 'addr' is a multiple of page size.  It returns zero for success, positive
>> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
>> error.
>
> Let me rephrase what I want to know or what I am telling. Assume my
> interface of __gnu_mbind_setup is as follows:
>
> int __gnu_mbind_setup (unsigned int type, __nvm_kmem_t *nvm_kmem_obj,
>                                                         size_t length);
>
> Then, What is the expected way to add the default implementation for
> this vendor specific interface in glibc? Because this differs from the
> interface seen in glibc (from your patch).
>
> Are you saying we should use #ifdef for default implementation of every
> vendor? And hence a vendor specific ld.so? I dont think this is what you
> meant as this has lot of obvious problems.
>
> Or are you saying that we do not add vendor specific default
> implementations at all in glibc and just keep one interface and one
> default implementation that you have mentioned? And then the real
> implementation (with a different interface) would override glibc one
> (with the interface you have defined)?

Yes, that is what I proposed.

> If that is what you are designing, it looks like the overriding is
> purely based on the symbol name and not the the full interface of the
> function. Personally I feel this is a overriding hack based on
> linker/loader symbol resolution magic.

The goal is to link in special memory run-time only when special memory
is used. Otherwise, every executable will be linked with libmbind.

> Since ld.so is not meant only for programs with C style linkage, what if
> the real implementation library is written in C++ and wants to export
> only mangled names (interfaces) without any "extern C" kludge? Or is
> this considered to be a standard C library call just like mmap etc.?

Only the __gnu_mbind_setup symbol is used.  We can change the
second argument to "void *data" and make it dependent on memory
type.  But to support a new memory type, we have to update ld.so.  I'd
like to use the same ld.so binary to support any memory types even if
it means that we need to pass info to __gnu_mbind_setup which isn't
used by all memory types.  The question is what the possible info needed
for all memory types is.

> And you may also want to define the flow for fully archive bound static
> binaries.

For static executable, __gnu_mbind_setup will be called on all MBIND
segments before constructors are called.  __gnu_mbind_setup in libc.a
is weak and will be overridden by the real one in libmbind.a.

> Assuming I am not missing anything above, if you still want to keep the
> interface as defined by you currently, I am OK with that. But we should
> at least add a couple of lines in addition, something like:
>
> "...which can be overridden by a different implementation at link-time.
> Such an implementation is required to provide a C style (unmangled)
> __gnu_mbind_setup function. However, the arguments and return type of
> the function need not match the one defined here"

What do you have in mind?  You can't change return type in real
__gnu_mbind_setup.  Otherwise, ld.so won't work correctly.

> BTW, what if the real implementation library also includes stddef.h?
> Wont there be prototype difference if the vendor
> implementation/interface is different?
>

The only type used is size_t.  It should be the same for everyone on
a given platform.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00   ` H.J. Lu
@ 2017-01-01  0:00     ` Carlos O'Donell
  2017-01-01  0:00       ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Carlos O'Donell @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: gnu-gabi

On 02/23/2017 09:59 PM, H.J. Lu wrote:
>> Why does it run _after_ all shared objects and the executable file are loaded?
> 
> Since __gnu_mbind_setup may call any external functions, it can only
> be done after everything is loaded and relocated.

Who defines this function?

Where is it implemented?

What does a typical implementation look like for MCDRAM use?

>> Why not let the dynamic loader choose when it needs to setup the memory?
> 
> 1. We want to be able to add support for new type memory by just
> updating the run-time library of __gnu_mbind_setup, instead of
> updating glibc.

Which library defines it?

Can two libraries define it? Does the dynamic loader run every DSO's
version of __gnu_mbind_setup?

> 2. Since __gnu_mbind_setup may depend on other libraries, we
> don't want a simple executable requires libfoo and libbar, in addition
> to glibc, nor make libfoo and libbar part of glibc.

Why can't this be run in a constructor? Is that too late?

This seems like a specialized form of constructor that is guaranteed
to run before all other constructors?

>>> int
>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>>> {
>>>   return 0;
>>> }
>>>
>>> which can be overridden by a different implementation at link-time.
>>
>> What if you _can't_ bind at ADDR?
> 
> It happens on systems without special memory.  __gnu_mbind_setup
> returns a positive value and ld.so keeps going.

Isn't this a violation of what the application binary requested?

This is a soft-failure that that application doesn't know about.

Might this become a security issue if the application expected the
specific memory type?

>> What if the binding would work if ADD was any value?
>>
> 
> GNU_MBIND isn't a LOAD segment,  similar to GNU_RELRO:
> 
> Program Headers:
>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>   LOAD           0x000000 0x00000000 0x00000000 0x54624 0x54624 R E 0x1000
>   LOAD           0x054e9c 0x00055e9c 0x00055e9c 0x001b0 0x001b8 RW  0x1000
>   DYNAMIC        0x054eac 0x00055eac 0x00055eac 0x00110 0x00110 RW  0x4
>   NOTE           0x000114 0x00000114 0x00000114 0x00044 0x00044 R   0x4
>   GNU_EH_FRAME   0x048eb8 0x00048eb8 0x00048eb8 0x00ff4 0x00ff4 R   0x4
>   GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
>   GNU_RELRO      0x054e9c 0x00055e9c 0x00055e9c 0x00164 0x00164 R   0x1
> 
> ADDR contains the start of a memory region within the LOAD segment.

What are the constraints of GNU_MBIND then?

Is it required that it covers only the SHF_GNU_MBIND marked sections which
are part of a PT_LOAD segment?

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00 RFC: ABI support for special memory area H.J. Lu
@ 2017-01-01  0:00 ` Carlos O'Donell
  2017-01-01  0:00   ` H.J. Lu
  2017-01-01  0:00 ` Suprateeka R Hegde
  1 sibling, 1 reply; 33+ messages in thread
From: Carlos O'Donell @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu, gnu-gabi

On 02/23/2017 11:19 AM, H.J. Lu wrote:
> A system may have MCDRAM or other types of memory in addition to
> normal RAM.  Here is an ABI proposal to allow placement in a section
> whose sh_info field indicates the special memory type.
> 
> Any comments?
> 
> H.J.
> ---
> To section attributes, add
> 
> #define SHF_GNU_MBIND     0x00100000
> 
> for sections used to place data or text into a special memory area.
> The section names should start with ".mbind" so that they won't be
> grouped together with normal sections by link editor. The sh_info
> field indicates the special memory type.
> 
> To the "Program Header" section, add an inclusive range of segment types
> for GNU_MBIND segments:
> 
> #define PT_GNU_MBIND_NUM    4096
> #define PT_GNU_MBIND_LO     (PT_LOOS + 0x474e555)
> #define PT_GNU_MBIND_HI     (PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)
> 
> The array element specifies the location and size of a special memory area.
> Each GNU_MBIND segment contains one GNU_MBIND section and the segment
> type is PT_GNU_MBIND_LO plus the sh_info value.  If the sh_info value is
> greater than PT_GNU_MBIND_NUM, no GNU_MBIND segment will be created.  Each
> GNU_MBIND segment must be aligned at page boundary.  The interpretation of
> the special memory area information is implementation-dependent.
> Implementations may ignore GNU_MBIND segment.
> 
> Run-time support
> 
> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
> 
> It sets up special memory area of 'type' and 'length' at 'addr' where
> 'addr' is a multiple of page size.  It returns zero for success, positive
> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
> error.
> 
> After all shared objects and the executable file are loaded, relocations
> are processed, for each GNU_MBIND segment in a shared object or the
> executable file, run-time loader calls __gnu_mbind_setup with type,
> address and length.  The default implementation of __gnu_mbind_setup is

Why does it run _after_ all shared objects and the executable file are loaded?

Why not let the dynamic loader choose when it needs to setup the memory?
 
> int
> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
> {
>   return 0;
> }
> 
> which can be overridden by a different implementation at link-time.

What if you _can't_ bind at ADDR?

What if the binding would work if ADD was any value?

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00     ` Suprateeka R Hegde
@ 2017-01-01  0:00       ` H.J. Lu
  2017-01-01  0:00         ` Carlos O'Donell
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: gnu-gabi

On Fri, Mar 3, 2017 at 4:28 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 02-Mar-2017 09:34 PM, H.J. Lu wrote:
>>
>> On Thu, Mar 2, 2017 at 7:16 AM, Suprateeka R Hegde
>> <hegdesmailbox@gmail.com> wrote:
>>>
>>> On 23-Feb-2017 09:49 PM, H.J. Lu wrote:
>>>>
>>>>
>>>>  The default implementation of __gnu_mbind_setup is
>>>>
>>>> int
>>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>>>> {
>>>>   return 0;
>>>> }
>>>>
>>>> which can be overridden by a different implementation at link-time.
>>>>
>>>
>>> Since this is a design that allows vendor specific extension and
>>> implementation, would it OK if we make it more generic?
>>
>>
>> Yes.
>>
>>> Instead of a fixed 3 arguments (type, addr, len), how about something
>>> like a
>>> pointer to a generic MBIND_CONTEXT struct (say of type
>>> __gnu_mbind_context
>>> defined)?  And let the implementation define the actual struct.
>>
>>
>> We can add more arguments.  But they must be predefined since
>> __gnu_mbind_setup is called from ld.so which must know what to
>> pass to __gnu_mbind_setup.
>
>
> <snip>
>
>> __gnu_mbind_setup is platform specific.  We can pass as
>> much as we need to __gnu_mbind_setup.  But they have to be fixed.
>
>
> I didnt understand this. Predefined/fixed where and how? As part of the ABI
> support? Or part of the implementation?

The interface is fixed so that it can be called from ld.so.   But its
implementation
is platform specific.

> If it is going to part of ABI, then we need to finalize and define it now.
> If it is part of the implementation, then __gnu_mbind_setup must be
> documented in the ABI as just a guideline to the implementer.
>
> What I meant by being generic is to have flexibility in number of arguments.
> Here is an incomplete/pseudo code of what I am trying to tell:
>
> enum __gnu_mbind_instance {
>    GNU_MBIND_DEFAULT,
>    GNU_MBIND_OVERRIDDEN
> };
>
> typedef struct {
>    __gnu_mbind_instance mb_inst; // mandatory. ABI specified
> #ifdef VENDOR_1
>    type1 identifier1; // optional implementation defined
>    type2 identifier2; // optional implementation defined
>    ...
>    typeN identifierN; // optional implementation defined
> #endif
> // Add vendor/implementation as necessary
> } __gnu_mbind_context;
>
> int __gnu_mbind_setup(__gnu_mbind_context*);
>
> Now, for MCDRAM instance, from my understanding:
>
> typedef struct {
>    __gnu_mbind_instance mb_inst;
> #ifdef VENDOR_MCDARM
>    unsigned int type;
>    void *addr;
>    size_t length;
> #endif
> } __gnu_mbind_context;
>
> int
> __gnu_mbind_setup (__gnu_mbind_context* mbind_context)
> {
>    /* Even if real implementation exist, check and allow disabling special
> memory bindings */
>    if (mbind_context->mb_inst == GNU_MBIND_DEFAULT) {
>       return 0;
>    }
>    else {
>       // return real_implementation_result;
>    }
> }
>
> If that is not what you are telling, I want to understand how to use it
> vendor specific way.

__gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
have to be used by all implementations nor all memory types.

>
>>> I would like to handle NVM/NVMe (long back I had mentioned about
>>> PT_PERSISTENT) through this MBIND and my implementation of handling
>>> NVM/NVMe
>>> needs more data to be passed to such "setup" functions.
>>
>>
>> I call it MBIND since a MBIND segment is inside a LOAD segment and
>> my real __gnu_mbind_setup will call mbind to move a MBIND region to
>> a NUMA node after it has been loaded and relocated. We can give it
>> a different name if you have a better one.
>
>
> I dont have any comment on this MBIND naming. It sounds good.
>
> --
> Supra



-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00               ` Suprateeka R Hegde
@ 2017-01-01  0:00                 ` H.J. Lu
  2017-01-01  0:00                   ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

[-- Attachment #1: Type: text/plain, Size: 6155 bytes --]

On Thu, Mar 9, 2017 at 7:23 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> H.J,
>
> I think we are full 180 degrees out-of-phase in our discussion this time
> somehow :-)
>
> As I have already asked, I want to know what is that ONE-FIXED-FORM of
> __gnu_mbind_setup being called by ld.so.
>
> The code you provided seems to be of Intel's implementation of libmbind. I
> am interested in how it looks like in ld.so. Because that is what we want to
> document in the ABI support. We do not want implementation specific details
> in GNU-gABI.
>
> So inside ld.so, would it be what I showed in my earlier mail or would it be
> something else?
>
> In my opinion, we have to bring that out in the ABI support proposal.
> Without the actual signature/prototype, __gnu_mbind_setup sounds more like a
> guideline and less like a ABI spec/standard. And in actual code (in ld.so),
> it may eventually appear really different for each vendor/implementation.
>
> So, either keep it as a guideline or make it generic. IMHO, we can not keep
> the following (original text) as generic:
>
> ---
>>
>> Run-time support
>>
>> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>
> ---
>
> --
> Supra
>
>
>
> On 07-Mar-2017 04:05 AM, H.J. Lu wrote:
>>
>> On Mon, Mar 6, 2017 at 5:25 AM, Suprateeka R Hegde
>> <hegdesmailbox@gmail.com> wrote:
>>>
>>> On 04-Mar-2017 07:37 AM, Carlos O'Donell wrote:
>>>>
>>>>
>>>> On 03/03/2017 11:00 AM, H.J. Lu wrote:
>>>>>
>>>>>
>>>>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>>>>> it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
>>>>> have to be used by all implementations nor all memory types.
>>>>
>>>>
>>>>
>>>> I think what Supra is suggesting is a pointer-to-implementation
>>>> interface
>>>> which would allow ld.so to pass completely different arguments to the
>>>> library depending on what kind of memory is being defined by the sh_info
>>>> value. It avoids needing to encode all the types in the API, and just
>>>> uses an incomplete pointer to the type.
>>>
>>>
>>>
>>> Thats absolutely right.
>>>
>>> However, I am not suggesting one is better over the other. I just want to
>>> get clarity on how the code looks like for different implementations.
>>>
>>> On 03-Mar-2017 09:30 PM, H.J. Lu wrote:
>>>>
>>>>
>>>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>>>> it needs to know what to pass to __gnu_mbind_setup.
>>>
>>>
>>>
>>> So I want to know what is that ONE-FIXED-FORM of __gnu_mbind_setup being
>>> called by ld.so.
>>>
>>>>  Not all arguments
>>>> have to be used by all implementations nor all memory types.
>>>
>>>
>>>
>>> I think I am still not getting this. Really sorry for that. Would it be
>>> possible for you to write a small pseudo code that depicts how this
>>> design
>>> looks like for different implementations?
>>>
>>
>> For my usage, I only want to know memory type, address and its size:
>>
>> #define _GNU_SOURCE
>> #include <unistd.h>
>> #include <errno.h>
>> #include <stdint.h>
>> #include <cpuid.h>
>> #include <numa.h>
>> #include <numaif.h>
>> #include <mbind.h>
>>
>> #ifdef LIBMBIND_DEBUG
>> #include <stdio.h>
>> #endif
>>
>> /* High-Bandwidth Memory node mask.  */
>> static struct bitmask *hbw_node_mask;
>>
>> /* Initialize High-Bandwidth Memory node mask.  This must be called before
>>    __gnu_mbind_setup.  */
>> static void
>> __attribute__ ((used, constructor))
>> init_node_mask (void)
>> {
>>   if (__get_cpuid_max (0, 0) == 0)
>>     return;
>>
>>   /* Check if vendor is Intel.  */
>>   uint32_t eax, ebx, ecx, edx;
>>   __cpuid (0, eax, ebx, ecx, edx);
>>   if (!(ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69))
>>     return;
>>
>>   /* Get family and model.  */
>>   uint32_t model;
>>   uint32_t family;
>>   __cpuid (1, eax, ebx, ecx, edx);
>>   family = (eax >> 8) & 0x0f;
>>   if (family != 0x6)
>>     return;
>>   model = (eax >> 4) & 0x0f;
>>   model += (eax >> 12) & 0xf0;
>>
>>   /* Check for KNL and KNM.  */
>>   switch (model)
>>     {
>>     default:
>>       return;
>>
>>     case 0x57: /* Knights Landing.  */
>>     case 0x85: /* Knights Mill.  */
>>       break;
>>     }
>>
>>   /* Check if NUMA configuration is supported.  */
>>   int nodes_num = numa_num_configured_nodes ();
>>   if (nodes_num < 2)
>>     return;
>>
>>   /* Get MCDRAM NUMA nodes.  */
>>   struct bitmask *node_mask = numa_allocate_nodemask ();
>>   struct bitmask *node_cpu = numa_allocate_cpumask ();
>>
>>   int i;
>>   for (i = 0; i < nodes_num; i++)
>>     {
>>       numa_node_to_cpus (i, node_cpu);
>>       /* NUMA node without CPU is MCDRAM node.  */
>>       if (numa_bitmask_weight (node_cpu) == 0)
>> numa_bitmask_setbit (node_mask, i);
>>     }
>>
>>   if (numa_bitmask_weight (node_mask) != 0)
>>     {
>>       /* On Knights Landing and Knights Mill, MCDRAM is High-Bandwidth
>> Memory.  */
>>       hbw_node_mask = node_mask;
>>     }
>>   else
>>     numa_bitmask_free (node_mask);
>>   numa_bitmask_free (node_cpu);
>> }
>>
>> /* Support all different memory types.  */
>>
>> static int
>> mbind_setup (unsigned int type, void *addr, size_t length,
>>     unsigned int mode, unsigned int flags)
>> {
>>   int err = ENXIO;
>>
>>   switch (type)
>>     {
>>     default:
>> #ifdef LIBMBIND_DEBUG
>>       printf ("Unsupported mbind type %d: from %p of size %p\n",
>>      type, addr, length);
>> #endif
>>       return EINVAL;
>>
>>     case GNU_MBIND_HBW:
>>       if (hbw_node_mask)
>> err = mbind (addr, length, mode, hbw_node_mask->maskp,
>>     hbw_node_mask->size, flags);
>>       break;
>>     }
>>
>>   if (err < 0)
>>     err = errno;
>>
>> #ifdef LIBMBIND_DEBUG
>>   printf ("Mbind type %d: from %p of size %p\n", type, addr, length);
>> #endif
>>
>>   return err;
>> }
>>
>> int
>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>> {
>>   return mbind_setup (type, addr, length, MPOL_BIND, MPOL_MF_MOVE);
>> }
>>
>> If other memory types need additional information, they can be
>> passed to __gnu_mbind_setup.  We just need to know what
>> information is needed.
>>
>>
>

Here is my glibc prototype.

-- 
H.J.

[-- Attachment #2: glibc-mbind.patch --]
[-- Type: text/x-patch, Size: 8431 bytes --]

diff --git a/csu/init-first.c b/csu/init-first.c
index 099e7bc..c7b8f1f 100644
--- a/csu/init-first.c
+++ b/csu/init-first.c
@@ -75,6 +75,10 @@ _init (int argc, char **argv, char **envp)
   /* First the initialization which normally would be done by the
      dynamic linker.  */
   _dl_non_dynamic_init ();
+
+# ifdef INIT_MBIND
+  INIT_MBIND (argv[0], _dl_phdr, _dl_phnum, 0);
+# endif
 #endif
 
 #ifdef VDSO_SETUP
diff --git a/elf/dl-init.c b/elf/dl-init.c
index 5c5f3de..7bd6af6 100644
--- a/elf/dl-init.c
+++ b/elf/dl-init.c
@@ -35,6 +35,14 @@ call_init (struct link_map *l, int argc, char **argv, char **env)
      dependency.  */
   l->l_init_called = 1;
 
+#ifdef INIT_MBIND
+  if (l->l_phdr)
+    {
+      const char *name = l->l_name[0] == '\0' ? argv[0] : l->l_name;
+      INIT_MBIND (name, l->l_phdr, l->l_phnum, l->l_addr);
+    }
+#endif
+
   /* Check for object which constructors we do not run here.  */
   if (__builtin_expect (l->l_name[0], 'a') == '\0'
       && l->l_type == lt_executable)
diff --git a/elf/dl-support.c b/elf/dl-support.c
index 3c46a7a..aa240c4 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -385,3 +385,7 @@ _dl_non_dynamic_init (void)
 #ifdef DL_SYSINFO_IMPLEMENTATION
 DL_SYSINFO_IMPLEMENTATION
 #endif
+
+#ifdef INIT_MBIND
+# include <setup-mbind.c>
+#endif
diff --git a/elf/elf.h b/elf/elf.h
index 6d3b356..a743cda 100644
--- a/elf/elf.h
+++ b/elf/elf.h
@@ -728,6 +728,11 @@ typedef struct
 #define PT_LOPROC	0x70000000	/* Start of processor-specific */
 #define PT_HIPROC	0x7fffffff	/* End of processor-specific */
 
+/* GNU mbind segments */
+#define PT_GNU_MBIND_NUM	4096
+#define PT_GNU_MBIND_LO		0x6474e555
+#define PT_GNU_MBIND_HI		(PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)
+
 /* Legal values for p_flags (segment flags).  */
 
 #define PF_X		(1 << 0)	/* Segment is executable */
diff --git a/sysdeps/unix/sysv/linux/x86/ldsodefs.h b/sysdeps/unix/sysv/linux/x86/ldsodefs.h
new file mode 100644
index 0000000..1b1c1f8
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/x86/ldsodefs.h
@@ -0,0 +1,26 @@
+/* Run-time dynamic linker data structures for loaded ELF shared objects.  x86
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef	_LDSODEFS_H
+
+/* Get the real definitions.  */
+#include_next <ldsodefs.h>
+
+#include <init-mbind.h>
+
+#endif /* ldsodefs.h */
diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
index 0d0326c2..7887a3c 100644
--- a/sysdeps/x86/Makefile
+++ b/sysdeps/x86/Makefile
@@ -3,7 +3,7 @@ gen-as-const-headers += cpu-features-offsets.sym
 endif
 
 ifeq ($(subdir),elf)
-sysdep-dl-routines += dl-get-cpu-features
+sysdep-dl-routines += dl-get-cpu-features setup-mbind
 
 tests += tst-get-cpu-features
 tests-static += tst-get-cpu-features-static
diff --git a/sysdeps/x86/Versions b/sysdeps/x86/Versions
index e029237..a627762 100644
--- a/sysdeps/x86/Versions
+++ b/sysdeps/x86/Versions
@@ -1,5 +1,8 @@
 ld {
   GLIBC_PRIVATE {
     __get_cpu_features;
+
+    # Set up special memory.
+    __gnu_mbind_setup;
   }
 }
diff --git a/sysdeps/x86/init-mbind.h b/sysdeps/x86/init-mbind.h
new file mode 100644
index 0000000..b881fdf
--- /dev/null
+++ b/sysdeps/x86/init-mbind.h
@@ -0,0 +1,46 @@
+/* This file is part of the GNU C Library.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <unistd.h>
+#include <libintl.h>
+#include <setup-mbind.h>
+
+static inline void
+init_mbind (const char *filename, const ElfW(Phdr) *phdr, size_t phnum,
+	    ElfW(Addr) load)
+{
+  ElfW(Addr) pagesize = GLRO(dl_pagesize);
+  for (; phnum; phnum--, phdr++)
+    if (phdr->p_type >= PT_GNU_MBIND_LO
+	&& phdr->p_type <= PT_GNU_MBIND_HI)
+      {
+	ElfW(Addr) start = phdr->p_vaddr;
+	if (pagesize > phdr->p_align
+	    || (start & (pagesize - 1)) != 0)
+	  _dl_fatal_printf (N_("%s: invalid PT_GNU_MBIND segment\n"),
+			    filename);
+
+	int error_code = __gnu_mbind_setup (phdr->p_type - PT_GNU_MBIND_LO,
+					    (void *) (load + start),
+					    phdr->p_memsz);
+	if (error_code < 0)
+	  _dl_fatal_printf (N_("__gnu_mbind_setup failed on file %s: error 0x%x\n"),
+			    filename, -error_code);
+      }
+}
+
+#define INIT_MBIND init_mbind
diff --git a/sysdeps/x86/setup-mbind.c b/sysdeps/x86/setup-mbind.c
new file mode 100644
index 0000000..d235b2e
--- /dev/null
+++ b/sysdeps/x86/setup-mbind.c
@@ -0,0 +1,27 @@
+/* This file is part of the GNU C Library.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+
+#include <setup-mbind.h>
+
+int weak_function
+__gnu_mbind_setup (unsigned int type __attribute__ ((unused)),
+		   void *addr __attribute__ ((unused)),
+		   size_t length __attribute__ ((unused)))
+{
+  return 0;
+}
diff --git a/sysdeps/x86/setup-mbind.h b/sysdeps/x86/setup-mbind.h
new file mode 100644
index 0000000..f26972f
--- /dev/null
+++ b/sysdeps/x86/setup-mbind.h
@@ -0,0 +1,21 @@
+/* This file is part of the GNU C Library.
+   Copyright (C) 2015 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stddef.h>
+
+/* Set up special memory.  */
+extern int weak_function __gnu_mbind_setup (unsigned int, void *, size_t);
diff --git a/sysdeps/x86_64/localplt.data b/sysdeps/x86_64/localplt.data
index 014a9f4..8f4d47c 100644
--- a/sysdeps/x86_64/localplt.data
+++ b/sysdeps/x86_64/localplt.data
@@ -11,6 +11,7 @@ libc.so: realloc + RELA R_X86_64_GLOB_DAT
 libm.so: matherr
 # The main malloc is interposed into the dynamic linker, for
 # allocations after the initial link (when dlopen is used).
+ld.so: __gnu_mbind_setup + RELA R_X86_64_GLOB_DAT
 ld.so: malloc + RELA R_X86_64_GLOB_DAT
 ld.so: calloc + RELA R_X86_64_GLOB_DAT
 ld.so: realloc + RELA R_X86_64_GLOB_DAT

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                             ` Florian Weimer
@ 2017-01-01  0:00                               ` H.J. Lu
  0 siblings, 0 replies; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Suprateeka R Hegde, Carlos O'Donell, gnu-gabi

On Thu, Mar 30, 2017 at 12:37 PM, Florian Weimer <fw@deneb.enyo.de> wrote:
>
> I see that you are now considering dl_iterate_phdr, which gives you
> access to the program headers, from where you can walk the section
> headers.  This looks like a good solution.
>
>>> I expect that libraries such as bdwgc might want to use the
>>> __gnu_mbind_setup callback as well, just to register freshly loaded shared
>>
>> Did you mean to mark pieces of memory garbage collectible? I guess it may
>> work.
>
> It's more for identifying roots to scan.
>
>>> objects and their data sections.  Can we make this work for multiple users?
>>>
>>
>> What did you mean by "multiple users"?
>
> Different libraries installing different hooks with similar
> intentions.

Yes, my dl_iterate_phdr approach works with multiple libraries.  Each of
them can call dl_iterate_phdr to process relevant segments.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                                     ` Suprateeka R Hegde
@ 2017-01-01  0:00                                       ` H.J. Lu
  0 siblings, 0 replies; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Fri, Mar 31, 2017 at 6:44 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 30-Mar-2017 10:10 PM, H.J. Lu wrote:
>>> However, I am just thinking that your earlier approach --
>>> __gnu_mbind_setup -- is better when shared libraries with GNU_MBIND
>>> segments are dlopen'ed. They dont have to iterate all over again to
>>> reach their PHDR. Or what is the recommendation for such dlopen'ed
>>> libraries?
>>
>> It is true that dl_iterate_phdr is called by every shared object, dlopened or
>> not, to locate its own PHDR.
>
> Lets put a one liner on best practices or guideline kind of. You have
> already made it clear in the example code. I am just thinking of putting
> them in words too.
>
> Lets say something like, each load module is expected to process only
> its special memory segments. To mean that shlibs/exe need not do any
> book-keeping to avoid multiple executions of the special memory setup
> for the same load module.
>
>>> And this dl_iterate_phdr(3) not being part of any standards, may change
>>> in a totally incompatible way in the future.
>>>
>>
>> dl_iterate_phdr isn't in any standard.  But it is in glibc.  Given that my
>> proposal is a GNU extension, it isn't a major issue.  Working with
>> existing glibc is a big plus.
>
> Awesome. Looks great. Thanks a lot for the new approach.
>
> --
> Supra

Here is the updated proposal.

Thanks.


-- 
H.J.
--
ABI support for special memory area

To section attributes, add

#define SHF_GNU_MBIND     0x00100000

for sections used to place data or text into a special memory area.
The section names should start with ".mbind" so that they won't be
grouped together with normal sections by link editor.  The sh_info
field indicates the special memory type.  SHF_GNU_MBIND is only
applicable to SHF_ALLOC sections.

The following memory types in the sh_info field are defined:

/* The highest bandwidth memory.   */
#define GNU_MBIND_HBW       0

To the "Program Header" section, add an inclusive range of segment types
for GNU_MBIND segments:

#define PT_GNU_MBIND_NUM    4096
#define PT_GNU_MBIND_LO     (PT_LOOS + 0x474e555)
#define PT_GNU_MBIND_HI     (PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)

The array element specifies the location and size of a special memory area.
Each GNU_MBIND segment contains one GNU_MBIND section and the segment
type is PT_GNU_MBIND_LO plus the sh_info value.  If the sh_info value is
greater than PT_GNU_MBIND_NUM, no GNU_MBIND segment will be created.  Each
GNU_MBIND segment must be aligned at page boundary.  The interpretation of
the special memory area information is implementation-dependent.
Implementations may ignore GNU_MBIND segment.

Run-time support

Each load module is expected to process only its special memory segments.
There is no need for executable and shared objects to do any book-keeping
to avoid multiple executions of the special memory setup for the same
load module.

dl_iterate_phdr in the the GNU C library:

int dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
                                      size_t size, void *data),
                     void *data);

is called via the .init_array section to process GNU_MBIND segments in
executable and shared objects:

static int
callback (struct dl_phdr_info *info, size_t size, void *data)
{
  Compute the load address of the current module.
  if info->dlpi_addr == the load address of the current module
    {
      check ELF program headers and process GNU_MBIND segments
      return 1;
    }

  return 0;
}

static void
call_gnu_mbind_setup (void)
{
  dl_iterate_phdr (callback, NULL);
}

static void (*init_array) (void)
 __attribute__ ((section (".init_array"), used))
 = &call_gnu_mbind_setup;

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                       ` H.J. Lu
@ 2017-01-01  0:00                         ` Suprateeka R Hegde
  2017-01-01  0:00                           ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi

On Friday 17 March 2017 02:55 AM, H.J. Lu wrote:
>> Since ld.so is not meant only for programs with C style linkage, what if
>> the real implementation library is written in C++ and wants to export
>> only mangled names (interfaces) without any "extern C" kludge? Or is
>> this considered to be a standard C library call just like mmap etc.?
> 
> Only the __gnu_mbind_setup symbol is used.  We can change the
> second argument to "void *data" and make it dependent on memory
> type.  But to support a new memory type, we have to update ld.so.  I'd
> like to use the same ld.so binary to support any memory types even if
> it means that we need to pass info to __gnu_mbind_setup which isn't
> used by all memory types.

Ah! Now I understand the design completely (I think). Looks like Carlos
understood this quite earlier in the discussion.

You are saying that the interface -

int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);

- is fixed in ld.so and also in the real implementation library. And,
the real implementation in turn calls the actual-real-implementation, as
shown in your libmbind code:

int
__gnu_mbind_setup (unsigned int type, void *addr, size_t length)
{
  // in turn calls actual implementation
  return vendor_specific_mbind_setup (vendor specific types);
}


All these while, based on the current description, I was of the
impression that your design allows __gnu_mbind_setup interface itself to
be overridden in the real implementation, something like:

int
__gnu_mbind_setup (__nvm_kmem_t *nvm_obj, void *nvm_handle)
{
  // actual implementation directly here in the body
}

So I was wondering how and hence most of my points were out-of-phase.

>  The question is what the possible info needed
> for all memory types is.

Thats too much to predict right now. And the current interface you
defined also does not seem to be generic. For instance, my NVM
implementation, though not complete, needs a totally different set of
arguments. So going by the current design, I will have to use
__gnu_mbind_setup (unsigned int type, void *addr, size_t length) just to
call my real setup, without using any of the arguments passed by ld.so.

Assuming I am in sync with you now, I would say that the pseudo code I
showed earlier works for you as well as for me as well as for anybody
else. In other words it is more generic.

With that approach, there is

1. No need to update ld.so every time for every new mem type
2. No need to know all possible info needed for all mem types
3. No need to encode all types in the API (as Carlos said)

We just use pointer to implementation interface - struct
__gnu_mbind_context that I showed. And we can have a default struct
instantiated in ld.so and a global pointer pointing to that. And later
the global pointer can be made to point to the vendor specific struct,
before ld.so actually calls __gnu_mbind_setup, thereby completing a
successful override (if necessary, that is when special memory types are
in use).

Or similar mechanisms to override default struct instantiated in ld.so.
There are many well known ways to override the default struct as we all
know.

Personally I think this would be a better way to provide the ABI support
in a generic way.

That said, I am OK to live with minor kludges and we can keep the design
as is.

> 
>> And you may also want to define the flow for fully archive bound static
>> binaries.
> 
> For static executable, __gnu_mbind_setup will be called on all MBIND
> segments before constructors are called.  __gnu_mbind_setup in libc.a
> is weak and will be overridden by the real one in libmbind.a.

Lets add this also in the ABI support document.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00       ` H.J. Lu
@ 2017-01-01  0:00         ` Florian Weimer
  2017-01-01  0:00           ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu, Carlos O'Donell; +Cc: gnu-gabi

On 02/28/2017 06:03 PM, H.J. Lu wrote:
> On Tue, Feb 28, 2017 at 8:19 AM, Carlos O'Donell <carlos@redhat.com> wrote:
>> On 02/23/2017 09:59 PM, H.J. Lu wrote:
>>>> Why does it run _after_ all shared objects and the executable file are loaded?
>>>
>>> Since __gnu_mbind_setup may call any external functions, it can only
>>> be done after everything is loaded and relocated.
>>
>> Who defines this function?
>
> Platform vendor with special memory support should provide such function.
>
>> Where is it implemented?
>
> We are working on libmbind to implement it.

That's backwards.  Either we need to merge libmbind in to glibc, or this 
should be something provided by the kernel vDSO.

We certainly don't want to repeat the mistake with the unwinder and 
libgcc_s.

>> Why can't this be run in a constructor? Is that too late?
>
> We can use MCDRAM for dynamically allocated memory with
> memkind.  We are looking for a user-friendly way to use MCDRAM
> for normal data variables.

Is it really necessary to avoid the pointer indirection?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00         ` Carlos O'Donell
@ 2017-01-01  0:00           ` Suprateeka R Hegde
  2017-01-01  0:00             ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Carlos O'Donell, H.J. Lu; +Cc: gnu-gabi

On 04-Mar-2017 07:37 AM, Carlos O'Donell wrote:
> On 03/03/2017 11:00 AM, H.J. Lu wrote:
>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>> it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
>> have to be used by all implementations nor all memory types.
>
> I think what Supra is suggesting is a pointer-to-implementation interface
> which would allow ld.so to pass completely different arguments to the
> library depending on what kind of memory is being defined by the sh_info
> value. It avoids needing to encode all the types in the API, and just
> uses an incomplete pointer to the type.

Thats absolutely right.

However, I am not suggesting one is better over the other. I just want 
to get clarity on how the code looks like for different implementations.

On 03-Mar-2017 09:30 PM, H.J. Lu wrote:
> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
> it needs to know what to pass to __gnu_mbind_setup.

So I want to know what is that ONE-FIXED-FORM of __gnu_mbind_setup being 
called by ld.so.

>  Not all arguments
> have to be used by all implementations nor all memory types.

I think I am still not getting this. Really sorry for that. Would it be 
possible for you to write a small pseudo code that depicts how this 
design looks like for different implementations?


Thanks a lot

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                 ` H.J. Lu
@ 2017-01-01  0:00                   ` H.J. Lu
  2017-01-01  0:00                     ` Florian Weimer
  2017-01-01  0:00                     ` Suprateeka R Hegde
  0 siblings, 2 replies; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

Here is the my current proposal.


-- 
H.J.
----
ABI support for special memory area

To section attributes, add

#define SHF_GNU_MBIND     0x00100000

for sections used to place data or text into a special memory area.
The section names should start with ".mbind" so that they won't be
grouped together with normal sections by link editor.  The sh_info
field indicates the special memory type.  SHF_GNU_MBIND is only
applicable to SHF_ALLOC sections.

To the "Program Header" section, add an inclusive range of segment types
for GNU_MBIND segments:

#define PT_GNU_MBIND_NUM    4096
#define PT_GNU_MBIND_LO     (PT_LOOS + 0x474e555)
#define PT_GNU_MBIND_HI     (PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)

The array element specifies the location and size of a special memory area.
Each GNU_MBIND segment contains one GNU_MBIND section and the segment
type is PT_GNU_MBIND_LO plus the sh_info value.  If the sh_info value is
greater than PT_GNU_MBIND_NUM, no GNU_MBIND segment will be created.  Each
GNU_MBIND segment must be aligned at page boundary.  The interpretation of
the special memory area information is implementation-dependent.
Implementations may ignore GNU_MBIND segment.

Run-time support

int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);

It sets up special memory area of 'type' and 'length' at 'addr' where
'addr' is a multiple of page size.  It returns zero for success, positive
value of ERRNO for non-fatal error and negative value of ERRNO for fatal
error.

After all shared objects and the executable file are loaded, relocations
are processed, for each GNU_MBIND segment in a shared object or the
executable file, run-time loader calls __gnu_mbind_setup with type,
address and length.  The default implementation of __gnu_mbind_setup is

int
__gnu_mbind_setup (unsigned int type, void *addr, size_t length)
{
  return 0;
}

which can be overridden by a different implementation at link-time.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00       ` H.J. Lu
@ 2017-01-01  0:00         ` Carlos O'Donell
  2017-01-01  0:00           ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: Carlos O'Donell @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu, Suprateeka R Hegde; +Cc: gnu-gabi

On 03/03/2017 11:00 AM, H.J. Lu wrote:
> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
> it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
> have to be used by all implementations nor all memory types.

I think what Supra is suggesting is a pointer-to-implementation interface
which would allow ld.so to pass completely different arguments to the 
library depending on what kind of memory is being defined by the sh_info
value. It avoids needing to encode all the types in the API, and just
uses an incomplete pointer to the type.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00 RFC: ABI support for special memory area H.J. Lu
  2017-01-01  0:00 ` Carlos O'Donell
@ 2017-01-01  0:00 ` Suprateeka R Hegde
  2017-01-01  0:00   ` H.J. Lu
  1 sibling, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu, gnu-gabi

On 23-Feb-2017 09:49 PM, H.J. Lu wrote:
>  The default implementation of __gnu_mbind_setup is
>
> int
> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
> {
>   return 0;
> }
>
> which can be overridden by a different implementation at link-time.
>

Since this is a design that allows vendor specific extension and 
implementation, would it OK if we make it more generic?

Instead of a fixed 3 arguments (type, addr, len), how about something 
like a pointer to a generic MBIND_CONTEXT struct (say of type 
__gnu_mbind_context defined)?  And let the implementation define the 
actual struct.

I would like to handle NVM/NVMe (long back I had mentioned about 
PT_PERSISTENT) through this MBIND and my implementation of handling 
NVM/NVMe needs more data to be passed to such "setup" functions.

Or is this __gnu_mbind_setup should be considered as a very basic / 
fundamental function (used just to setup the "memory area") and 
implementations/vendors are expected to write wrapper/handler functions 
to handle other aspects of the special memory? In that case the fixed 
set of basic args looks OK.

IMHO this __gnu_mbind_setup is a very good design to be generic enough 
and not be very specific/basic/fundamental runtime support.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                           ` H.J. Lu
@ 2017-01-01  0:00                             ` Suprateeka R Hegde
  2017-01-01  0:00                               ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi



On 03/20/17 23:52, H.J. Lu wrote:
> On Fri, Mar 17, 2017 at 11:12 AM, Suprateeka R Hegde
> <hegdesmailbox@gmail.com> wrote:
>> On Friday 17 March 2017 02:55 AM, H.J. Lu wrote:
>>>> Since ld.so is not meant only for programs with C style linkage, what if
>>>> the real implementation library is written in C++ and wants to export
>>>> only mangled names (interfaces) without any "extern C" kludge? Or is
>>>> this considered to be a standard C library call just like mmap etc.?
>>>
>>> Only the __gnu_mbind_setup symbol is used.  We can change the
>>> second argument to "void *data" and make it dependent on memory
>>> type.  But to support a new memory type, we have to update ld.so.  I'd
>>> like to use the same ld.so binary to support any memory types even if
>>> it means that we need to pass info to __gnu_mbind_setup which isn't
>>> used by all memory types.
>>
>> Ah! Now I understand the design completely (I think). Looks like Carlos
>> understood this quite earlier in the discussion.
>>
>> You are saying that the interface -
>>
>> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>>
>> - is fixed in ld.so and also in the real implementation library. And,
>> the real implementation in turn calls the actual-real-implementation, as
>> shown in your libmbind code:
>>
>> int
>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>> {
>>   // in turn calls actual implementation
>>   return vendor_specific_mbind_setup (vendor specific types);
>> }
>>
>>
>> All these while, based on the current description, I was of the
>> impression that your design allows __gnu_mbind_setup interface itself to
>> be overridden in the real implementation, something like:
>>
>> int
>> __gnu_mbind_setup (__nvm_kmem_t *nvm_obj, void *nvm_handle)
>> {
>>   // actual implementation directly here in the body
>> }
>>
>> So I was wondering how and hence most of my points were out-of-phase.
>>
>>>  The question is what the possible info needed
>>> for all memory types is.
>>
>> Thats too much to predict right now. And the current interface you
>> defined also does not seem to be generic. For instance, my NVM
>> implementation, though not complete, needs a totally different set of
>> arguments. So going by the current design, I will have to use
>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length) just to
>> call my real setup, without using any of the arguments passed by ld.so.
>>
>> Assuming I am in sync with you now, I would say that the pseudo code I
>> showed earlier works for you as well as for me as well as for anybody
>> else. In other words it is more generic.
>>
>> With that approach, there is
>>
>> 1. No need to update ld.so every time for every new mem type
>> 2. No need to know all possible info needed for all mem types
>> 3. No need to encode all types in the API (as Carlos said)
>>
>> We just use pointer to implementation interface - struct
>> __gnu_mbind_context that I showed. And we can have a default struct
>> instantiated in ld.so and a global pointer pointing to that. And later
>> the global pointer can be made to point to the vendor specific struct,
>> before ld.so actually calls __gnu_mbind_setup, thereby completing a
>> successful override (if necessary, that is when special memory types are
>> in use).
>>
>> Or similar mechanisms to override default struct instantiated in ld.so.
>> There are many well known ways to override the default struct as we all
>> know.
>>
>> Personally I think this would be a better way to provide the ABI support
>> in a generic way.
> 
> ld.so needs to call the real __gnu_mbind_setup implementation
> with the correct argument. 

Yes and with my example code, ld.so calls with correct argument always.
And its always only one argument -- a pointer to struct. By default
pointing  to default struct, and when overridden pointing to
implementation specific struct.

>  We can keep it ASIS and add a new
> new one, __gnu_mbind_setup_v2, if needed.

Hmm :-)

This also looks good. Though whoever adds this _v2 (assuming its me
right now :-)), gets to ensure all the herculean compatibility hooks for
_v1 are in place. But thats OK I believe.

> 
>> That said, I am OK to live with minor kludges and we can keep the design
>> as is.
>>
>>>
>>>> And you may also want to define the flow for fully archive bound static
>>>> binaries.
>>>
>>> For static executable, __gnu_mbind_setup will be called on all MBIND
>>> segments before constructors are called.  __gnu_mbind_setup in libc.a
>>> is weak and will be overridden by the real one in libmbind.a.
>>
>> Lets add this also in the ABI support document.
>>
> 
> How about this:
> 
> Run-time support
> 
> int __gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length);
> 
> It sets up special memory area of 'type' and 'length' at 'addr' where
> 'addr' is a multiple of page size.  It returns zero for success, positive
> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
> error.
> 
> After all shared objects and the executable file are loaded, relocations
> are processed, for each GNU_MBIND segment in a shared object or the
> executable file, run-time loader calls __gnu_mbind_setup_v1 with type,
> address and length.  If __gnu_mbind_setup_v1 must be defined in run-time
> loader, it should be implemented as a weak function:
> 
> int
> __gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length)
> {
>   return 0;
> }
> 
> in run-time loader so that the GNU_MBIND run-time library isn't required
> for normal executable nor shared object.  The real implementation of
> __gnu_mbind_setup_v1 should be in the GNU_MBIND run-time library and
> overridde the weak one in run-time loader.

Looks good to me.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                   ` H.J. Lu
@ 2017-01-01  0:00                     ` Florian Weimer
  2017-01-01  0:00                       ` H.J. Lu
  2017-01-01  0:00                     ` Suprateeka R Hegde
  1 sibling, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu, Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On 03/15/2017 11:03 PM, H.J. Lu wrote:
> After all shared objects and the executable file are loaded, relocations
> are processed, for each GNU_MBIND segment in a shared object or the
> executable file, run-time loader calls __gnu_mbind_setup with type,
> address and length.  The default implementation of __gnu_mbind_setup is

Is there a specified invocation order for the segments?

Does the call happen immediately after relocations for an object are 
processed, or only after relocations for all objections are processed?

If the latter, why can't you use the existing ELF constructor mechanism 
for this?  As far as I understand it, the call to __gnu_mbind_setup 
would just happen before the constructor calls.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                             ` Suprateeka R Hegde
@ 2017-01-01  0:00                               ` H.J. Lu
  2017-01-01  0:00                                 ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Wed, Mar 22, 2017 at 9:18 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
>
>
> On 03/20/17 23:52, H.J. Lu wrote:
>> On Fri, Mar 17, 2017 at 11:12 AM, Suprateeka R Hegde
>> <hegdesmailbox@gmail.com> wrote:
>>> On Friday 17 March 2017 02:55 AM, H.J. Lu wrote:
>>>>> Since ld.so is not meant only for programs with C style linkage, what if
>>>>> the real implementation library is written in C++ and wants to export
>>>>> only mangled names (interfaces) without any "extern C" kludge? Or is
>>>>> this considered to be a standard C library call just like mmap etc.?
>>>>
>>>> Only the __gnu_mbind_setup symbol is used.  We can change the
>>>> second argument to "void *data" and make it dependent on memory
>>>> type.  But to support a new memory type, we have to update ld.so.  I'd
>>>> like to use the same ld.so binary to support any memory types even if
>>>> it means that we need to pass info to __gnu_mbind_setup which isn't
>>>> used by all memory types.
>>>
>>> Ah! Now I understand the design completely (I think). Looks like Carlos
>>> understood this quite earlier in the discussion.
>>>
>>> You are saying that the interface -
>>>
>>> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>>>
>>> - is fixed in ld.so and also in the real implementation library. And,
>>> the real implementation in turn calls the actual-real-implementation, as
>>> shown in your libmbind code:
>>>
>>> int
>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>>> {
>>>   // in turn calls actual implementation
>>>   return vendor_specific_mbind_setup (vendor specific types);
>>> }
>>>
>>>
>>> All these while, based on the current description, I was of the
>>> impression that your design allows __gnu_mbind_setup interface itself to
>>> be overridden in the real implementation, something like:
>>>
>>> int
>>> __gnu_mbind_setup (__nvm_kmem_t *nvm_obj, void *nvm_handle)
>>> {
>>>   // actual implementation directly here in the body
>>> }
>>>
>>> So I was wondering how and hence most of my points were out-of-phase.
>>>
>>>>  The question is what the possible info needed
>>>> for all memory types is.
>>>
>>> Thats too much to predict right now. And the current interface you
>>> defined also does not seem to be generic. For instance, my NVM
>>> implementation, though not complete, needs a totally different set of
>>> arguments. So going by the current design, I will have to use
>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length) just to
>>> call my real setup, without using any of the arguments passed by ld.so.
>>>
>>> Assuming I am in sync with you now, I would say that the pseudo code I
>>> showed earlier works for you as well as for me as well as for anybody
>>> else. In other words it is more generic.
>>>
>>> With that approach, there is
>>>
>>> 1. No need to update ld.so every time for every new mem type
>>> 2. No need to know all possible info needed for all mem types
>>> 3. No need to encode all types in the API (as Carlos said)
>>>
>>> We just use pointer to implementation interface - struct
>>> __gnu_mbind_context that I showed. And we can have a default struct
>>> instantiated in ld.so and a global pointer pointing to that. And later
>>> the global pointer can be made to point to the vendor specific struct,
>>> before ld.so actually calls __gnu_mbind_setup, thereby completing a
>>> successful override (if necessary, that is when special memory types are
>>> in use).
>>>
>>> Or similar mechanisms to override default struct instantiated in ld.so.
>>> There are many well known ways to override the default struct as we all
>>> know.
>>>
>>> Personally I think this would be a better way to provide the ABI support
>>> in a generic way.
>>
>> ld.so needs to call the real __gnu_mbind_setup implementation
>> with the correct argument.
>
> Yes and with my example code, ld.so calls with correct argument always.
> And its always only one argument -- a pointer to struct. By default
> pointing  to default struct, and when overridden pointing to
> implementation specific struct.
>
>>  We can keep it ASIS and add a new
>> new one, __gnu_mbind_setup_v2, if needed.
>
> Hmm :-)
>
> This also looks good. Though whoever adds this _v2 (assuming its me
> right now :-)), gets to ensure all the herculean compatibility hooks for
> _v1 are in place. But thats OK I believe.
>
>>
>>> That said, I am OK to live with minor kludges and we can keep the design
>>> as is.
>>>
>>>>
>>>>> And you may also want to define the flow for fully archive bound static
>>>>> binaries.
>>>>
>>>> For static executable, __gnu_mbind_setup will be called on all MBIND
>>>> segments before constructors are called.  __gnu_mbind_setup in libc.a
>>>> is weak and will be overridden by the real one in libmbind.a.
>>>
>>> Lets add this also in the ABI support document.
>>>
>>
>> How about this:
>>
>> Run-time support
>>
>> int __gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length);
>>
>> It sets up special memory area of 'type' and 'length' at 'addr' where
>> 'addr' is a multiple of page size.  It returns zero for success, positive
>> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
>> error.
>>
>> After all shared objects and the executable file are loaded, relocations
>> are processed, for each GNU_MBIND segment in a shared object or the
>> executable file, run-time loader calls __gnu_mbind_setup_v1 with type,
>> address and length.  If __gnu_mbind_setup_v1 must be defined in run-time
>> loader, it should be implemented as a weak function:
>>
>> int
>> __gnu_mbind_setup_v1 (unsigned int type, void *addr, size_t length)
>> {
>>   return 0;
>> }
>>
>> in run-time loader so that the GNU_MBIND run-time library isn't required
>> for normal executable nor shared object.  The real implementation of
>> __gnu_mbind_setup_v1 should be in the GNU_MBIND run-time library and
>> overridde the weak one in run-time loader.
>
> Looks good to me.
>
> --
> Supra

There is a way to support GNU_MBIND segments without the glibc changes.
Instead, dl_iterate_phdr

int dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
                                      size_t size, void *data),
                     void *data);

is called via the .init_array section to process GNU_MBIND segments in
executable and shared objects:

static int
callback (struct dl_phdr_info *info, size_t size, void *data)
{
  Compute the load address of the current module.
  if info->dlpi_addr == the load address of the current module
    {
      check ELF program headers and process GNU_MBIND segments
      return 1;
    }

  return 0;
}

static void
call_gnu_mbind_setup (void)
{
  dl_iterate_phdr (callback, NULL);
}

static void (*init_array) (void)
 __attribute__ ((section (".init_array"), used))
 = &call_gnu_mbind_setup;


-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                           ` H.J. Lu
@ 2017-01-01  0:00                             ` Florian Weimer
  2017-01-01  0:00                               ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Suprateeka R Hegde, Carlos O'Donell, gnu-gabi

* H. J. Lu:

> On Mon, Mar 20, 2017 at 7:57 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 03/16/2017 07:22 PM, H.J. Lu wrote:
>>
>>>> If the latter, why can't you use the existing ELF constructor mechanism
>>>> for
>>>> this?  As far as I understand it, the call to __gnu_mbind_setup would
>>>> just
>>>> happen before the constructor calls.
>>>
>>>
>>> That is correct.  The issue is to access the ELF segment header for each
>>> loaded object only once.  There is no good way to get this info from
>>> constructor.
>>
>>
>> I think you can get the data in a pretty straightforward manner using
>> dlinfo.
>
> dlinfo is used to info from application.  I don't see how it can be used
> here.

You can get an opaque handle from an address and feed it into dlinfo.
However, dlinfo only gives you access to the dynamic section, so
unless you put your section markup there as well, it won't help you.

I see that you are now considering dl_iterate_phdr, which gives you
access to the program headers, from where you can walk the section
headers.  This looks like a good solution.

>> I expect that libraries such as bdwgc might want to use the
>> __gnu_mbind_setup callback as well, just to register freshly loaded shared
>
> Did you mean to mark pieces of memory garbage collectible? I guess it may
> work.

It's more for identifying roots to scan.

>> objects and their data sections.  Can we make this work for multiple users?
>>
>
> What did you mean by "multiple users"?

Different libraries installing different hooks with similar
intentions.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00             ` H.J. Lu
@ 2017-01-01  0:00               ` Suprateeka R Hegde
  2017-01-01  0:00                 ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi

H.J,

I think we are full 180 degrees out-of-phase in our discussion this time 
somehow :-)

As I have already asked, I want to know what is that ONE-FIXED-FORM of 
__gnu_mbind_setup being called by ld.so.

The code you provided seems to be of Intel's implementation of libmbind. 
I am interested in how it looks like in ld.so. Because that is what we 
want to document in the ABI support. We do not want implementation 
specific details in GNU-gABI.

So inside ld.so, would it be what I showed in my earlier mail or would 
it be something else?

In my opinion, we have to bring that out in the ABI support proposal. 
Without the actual signature/prototype, __gnu_mbind_setup sounds more 
like a guideline and less like a ABI spec/standard. And in actual code 
(in ld.so), it may eventually appear really different for each 
vendor/implementation.

So, either keep it as a guideline or make it generic. IMHO, we can not 
keep the following (original text) as generic:

---
> Run-time support
>
> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
---

--
Supra


On 07-Mar-2017 04:05 AM, H.J. Lu wrote:
> On Mon, Mar 6, 2017 at 5:25 AM, Suprateeka R Hegde
> <hegdesmailbox@gmail.com> wrote:
>> On 04-Mar-2017 07:37 AM, Carlos O'Donell wrote:
>>>
>>> On 03/03/2017 11:00 AM, H.J. Lu wrote:
>>>>
>>>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>>>> it needs to know what to pass to __gnu_mbind_setup.  Not all arguments
>>>> have to be used by all implementations nor all memory types.
>>>
>>>
>>> I think what Supra is suggesting is a pointer-to-implementation interface
>>> which would allow ld.so to pass completely different arguments to the
>>> library depending on what kind of memory is being defined by the sh_info
>>> value. It avoids needing to encode all the types in the API, and just
>>> uses an incomplete pointer to the type.
>>
>>
>> Thats absolutely right.
>>
>> However, I am not suggesting one is better over the other. I just want to
>> get clarity on how the code looks like for different implementations.
>>
>> On 03-Mar-2017 09:30 PM, H.J. Lu wrote:
>>>
>>> __gnu_mbind_setup is called from ld.so.  Since there is only one ld.so,
>>> it needs to know what to pass to __gnu_mbind_setup.
>>
>>
>> So I want to know what is that ONE-FIXED-FORM of __gnu_mbind_setup being
>> called by ld.so.
>>
>>>  Not all arguments
>>> have to be used by all implementations nor all memory types.
>>
>>
>> I think I am still not getting this. Really sorry for that. Would it be
>> possible for you to write a small pseudo code that depicts how this design
>> looks like for different implementations?
>>
>
> For my usage, I only want to know memory type, address and its size:
>
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <errno.h>
> #include <stdint.h>
> #include <cpuid.h>
> #include <numa.h>
> #include <numaif.h>
> #include <mbind.h>
>
> #ifdef LIBMBIND_DEBUG
> #include <stdio.h>
> #endif
>
> /* High-Bandwidth Memory node mask.  */
> static struct bitmask *hbw_node_mask;
>
> /* Initialize High-Bandwidth Memory node mask.  This must be called before
>    __gnu_mbind_setup.  */
> static void
> __attribute__ ((used, constructor))
> init_node_mask (void)
> {
>   if (__get_cpuid_max (0, 0) == 0)
>     return;
>
>   /* Check if vendor is Intel.  */
>   uint32_t eax, ebx, ecx, edx;
>   __cpuid (0, eax, ebx, ecx, edx);
>   if (!(ebx == 0x756e6547 && ecx == 0x6c65746e && edx == 0x49656e69))
>     return;
>
>   /* Get family and model.  */
>   uint32_t model;
>   uint32_t family;
>   __cpuid (1, eax, ebx, ecx, edx);
>   family = (eax >> 8) & 0x0f;
>   if (family != 0x6)
>     return;
>   model = (eax >> 4) & 0x0f;
>   model += (eax >> 12) & 0xf0;
>
>   /* Check for KNL and KNM.  */
>   switch (model)
>     {
>     default:
>       return;
>
>     case 0x57: /* Knights Landing.  */
>     case 0x85: /* Knights Mill.  */
>       break;
>     }
>
>   /* Check if NUMA configuration is supported.  */
>   int nodes_num = numa_num_configured_nodes ();
>   if (nodes_num < 2)
>     return;
>
>   /* Get MCDRAM NUMA nodes.  */
>   struct bitmask *node_mask = numa_allocate_nodemask ();
>   struct bitmask *node_cpu = numa_allocate_cpumask ();
>
>   int i;
>   for (i = 0; i < nodes_num; i++)
>     {
>       numa_node_to_cpus (i, node_cpu);
>       /* NUMA node without CPU is MCDRAM node.  */
>       if (numa_bitmask_weight (node_cpu) == 0)
> numa_bitmask_setbit (node_mask, i);
>     }
>
>   if (numa_bitmask_weight (node_mask) != 0)
>     {
>       /* On Knights Landing and Knights Mill, MCDRAM is High-Bandwidth
> Memory.  */
>       hbw_node_mask = node_mask;
>     }
>   else
>     numa_bitmask_free (node_mask);
>   numa_bitmask_free (node_cpu);
> }
>
> /* Support all different memory types.  */
>
> static int
> mbind_setup (unsigned int type, void *addr, size_t length,
>     unsigned int mode, unsigned int flags)
> {
>   int err = ENXIO;
>
>   switch (type)
>     {
>     default:
> #ifdef LIBMBIND_DEBUG
>       printf ("Unsupported mbind type %d: from %p of size %p\n",
>      type, addr, length);
> #endif
>       return EINVAL;
>
>     case GNU_MBIND_HBW:
>       if (hbw_node_mask)
> err = mbind (addr, length, mode, hbw_node_mask->maskp,
>     hbw_node_mask->size, flags);
>       break;
>     }
>
>   if (err < 0)
>     err = errno;
>
> #ifdef LIBMBIND_DEBUG
>   printf ("Mbind type %d: from %p of size %p\n", type, addr, length);
> #endif
>
>   return err;
> }
>
> int
> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
> {
>   return mbind_setup (type, addr, length, MPOL_BIND, MPOL_MF_MOVE);
> }
>
> If other memory types need additional information, they can be
> passed to __gnu_mbind_setup.  We just need to know what
> information is needed.
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                       ` H.J. Lu
@ 2017-01-01  0:00                         ` Florian Weimer
  2017-01-01  0:00                           ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Florian Weimer @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Suprateeka R Hegde, Carlos O'Donell, gnu-gabi

On 03/16/2017 07:22 PM, H.J. Lu wrote:

>> If the latter, why can't you use the existing ELF constructor mechanism for
>> this?  As far as I understand it, the call to __gnu_mbind_setup would just
>> happen before the constructor calls.
>
> That is correct.  The issue is to access the ELF segment header for each
> loaded object only once.  There is no good way to get this info from
> constructor.

I think you can get the data in a pretty straightforward manner using 
dlinfo.

I expect that libraries such as bdwgc might want to use the 
__gnu_mbind_setup callback as well, just to register freshly loaded 
shared objects and their data sections.  Can we make this work for 
multiple users?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RFC: ABI support for special memory area
@ 2017-01-01  0:00 H.J. Lu
  2017-01-01  0:00 ` Carlos O'Donell
  2017-01-01  0:00 ` Suprateeka R Hegde
  0 siblings, 2 replies; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: gnu-gabi

A system may have MCDRAM or other types of memory in addition to
normal RAM.  Here is an ABI proposal to allow placement in a section
whose sh_info field indicates the special memory type.

Any comments?

H.J.
---
To section attributes, add

#define SHF_GNU_MBIND     0x00100000

for sections used to place data or text into a special memory area.
The section names should start with ".mbind" so that they won't be
grouped together with normal sections by link editor. The sh_info
field indicates the special memory type.

To the "Program Header" section, add an inclusive range of segment types
for GNU_MBIND segments:

#define PT_GNU_MBIND_NUM    4096
#define PT_GNU_MBIND_LO     (PT_LOOS + 0x474e555)
#define PT_GNU_MBIND_HI     (PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)

The array element specifies the location and size of a special memory area.
Each GNU_MBIND segment contains one GNU_MBIND section and the segment
type is PT_GNU_MBIND_LO plus the sh_info value.  If the sh_info value is
greater than PT_GNU_MBIND_NUM, no GNU_MBIND segment will be created.  Each
GNU_MBIND segment must be aligned at page boundary.  The interpretation of
the special memory area information is implementation-dependent.
Implementations may ignore GNU_MBIND segment.

Run-time support

int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);

It sets up special memory area of 'type' and 'length' at 'addr' where
'addr' is a multiple of page size.  It returns zero for success, positive
value of ERRNO for non-fatal error and negative value of ERRNO for fatal
error.

After all shared objects and the executable file are loaded, relocations
are processed, for each GNU_MBIND segment in a shared object or the
executable file, run-time loader calls __gnu_mbind_setup with type,
address and length.  The default implementation of __gnu_mbind_setup is

int
__gnu_mbind_setup (unsigned int type, void *addr, size_t length)
{
  return 0;
}

which can be overridden by a different implementation at link-time.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                                   ` H.J. Lu
@ 2017-01-01  0:00                                     ` Suprateeka R Hegde
  2017-01-01  0:00                                       ` H.J. Lu
  0 siblings, 1 reply; 33+ messages in thread
From: Suprateeka R Hegde @ 2017-01-01  0:00 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Carlos O'Donell, gnu-gabi

On 30-Mar-2017 10:10 PM, H.J. Lu wrote:
>> However, I am just thinking that your earlier approach --
>> __gnu_mbind_setup -- is better when shared libraries with GNU_MBIND
>> segments are dlopen'ed. They dont have to iterate all over again to
>> reach their PHDR. Or what is the recommendation for such dlopen'ed
>> libraries?
> 
> It is true that dl_iterate_phdr is called by every shared object, dlopened or
> not, to locate its own PHDR.

Lets put a one liner on best practices or guideline kind of. You have
already made it clear in the example code. I am just thinking of putting
them in words too.

Lets say something like, each load module is expected to process only
its special memory segments. To mean that shlibs/exe need not do any
book-keeping to avoid multiple executions of the special memory setup
for the same load module.

>> And this dl_iterate_phdr(3) not being part of any standards, may change
>> in a totally incompatible way in the future.
>>
> 
> dl_iterate_phdr isn't in any standard.  But it is in glibc.  Given that my
> proposal is a GNU extension, it isn't a major issue.  Working with
> existing glibc is a big plus.

Awesome. Looks great. Thanks a lot for the new approach.

--
Supra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00     ` Carlos O'Donell
@ 2017-01-01  0:00       ` H.J. Lu
  2017-01-01  0:00         ` Florian Weimer
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: gnu-gabi

On Tue, Feb 28, 2017 at 8:19 AM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 02/23/2017 09:59 PM, H.J. Lu wrote:
>>> Why does it run _after_ all shared objects and the executable file are loaded?
>>
>> Since __gnu_mbind_setup may call any external functions, it can only
>> be done after everything is loaded and relocated.
>
> Who defines this function?

Platform vendor with special memory support should provide such function.

> Where is it implemented?

We are working on libmbind to implement it.

> What does a typical implementation look like for MCDRAM use?

It uses NUMA, similar to memkind:

https://github.com/memkind/memkind

to bind pages to a NUMA node.

>>> Why not let the dynamic loader choose when it needs to setup the memory?
>>
>> 1. We want to be able to add support for new type memory by just
>> updating the run-time library of __gnu_mbind_setup, instead of
>> updating glibc.
>
> Which library defines it?

The default __gnu_mbind_setup is a weak function in ld.so since
ld.so can't have undefined function.  The real one is in libmbind
which overrides the default one in ld.so.

> Can two libraries define it? Does the dynamic loader run every DSO's
> version of __gnu_mbind_setup?

Only one will be used by ld.so.

>> 2. Since __gnu_mbind_setup may depend on other libraries, we
>> don't want a simple executable requires libfoo and libbar, in addition
>> to glibc, nor make libfoo and libbar part of glibc.
>
> Why can't this be run in a constructor? Is that too late?

We can use MCDRAM for dynamically allocated memory with
memkind.  We are looking for a user-friendly way to use MCDRAM
for normal data variables.

> This seems like a specialized form of constructor that is guaranteed
> to run before all other constructors?

Yes.

>>>> int
>>>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>>>> {
>>>>   return 0;
>>>> }
>>>>
>>>> which can be overridden by a different implementation at link-time.
>>>
>>> What if you _can't_ bind at ADDR?
>>
>> It happens on systems without special memory.  __gnu_mbind_setup
>> returns a positive value and ld.so keeps going.
>
> Isn't this a violation of what the application binary requested?

Even on systems with MCDROM, you may not exceed the limit.
The application should still run correctly.

> This is a soft-failure that that application doesn't know about.

Performance may be lower.  But it will run correctly.

> Might this become a security issue if the application expected the
> specific memory type?

__gnu_mbind_setup returns a negative value for fatal error if
security is involved and ld.so aborts on fatal error.

>>> What if the binding would work if ADD was any value?
>>>
>>
>> GNU_MBIND isn't a LOAD segment,  similar to GNU_RELRO:
>>
>> Program Headers:
>>   Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
>>   LOAD           0x000000 0x00000000 0x00000000 0x54624 0x54624 R E 0x1000
>>   LOAD           0x054e9c 0x00055e9c 0x00055e9c 0x001b0 0x001b8 RW  0x1000
>>   DYNAMIC        0x054eac 0x00055eac 0x00055eac 0x00110 0x00110 RW  0x4
>>   NOTE           0x000114 0x00000114 0x00000114 0x00044 0x00044 R   0x4
>>   GNU_EH_FRAME   0x048eb8 0x00048eb8 0x00048eb8 0x00ff4 0x00ff4 R   0x4
>>   GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
>>   GNU_RELRO      0x054e9c 0x00055e9c 0x00055e9c 0x00164 0x00164 R   0x1
>>
>> ADDR contains the start of a memory region within the LOAD segment.
>
> What are the constraints of GNU_MBIND then?

Each GNU_MBIND segment must be aligned at page boundary
and within one LOAD segment.

> Is it required that it covers only the SHF_GNU_MBIND marked sections which
> are part of a PT_LOAD segment?

The SHF_ALLOC bit must be set for SHF_GNU_MBIND sections.


-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00 ` Carlos O'Donell
@ 2017-01-01  0:00   ` H.J. Lu
  2017-01-01  0:00     ` Carlos O'Donell
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: gnu-gabi

On Thu, Feb 23, 2017 at 4:55 PM, Carlos O'Donell <carlos@redhat.com> wrote:
> On 02/23/2017 11:19 AM, H.J. Lu wrote:
>> A system may have MCDRAM or other types of memory in addition to
>> normal RAM.  Here is an ABI proposal to allow placement in a section
>> whose sh_info field indicates the special memory type.
>>
>> Any comments?
>>
>> H.J.
>> ---
>> To section attributes, add
>>
>> #define SHF_GNU_MBIND     0x00100000
>>
>> for sections used to place data or text into a special memory area.
>> The section names should start with ".mbind" so that they won't be
>> grouped together with normal sections by link editor. The sh_info
>> field indicates the special memory type.
>>
>> To the "Program Header" section, add an inclusive range of segment types
>> for GNU_MBIND segments:
>>
>> #define PT_GNU_MBIND_NUM    4096
>> #define PT_GNU_MBIND_LO     (PT_LOOS + 0x474e555)
>> #define PT_GNU_MBIND_HI     (PT_GNU_MBIND_LO + PT_GNU_MBIND_NUM - 1)
>>
>> The array element specifies the location and size of a special memory area.
>> Each GNU_MBIND segment contains one GNU_MBIND section and the segment
>> type is PT_GNU_MBIND_LO plus the sh_info value.  If the sh_info value is
>> greater than PT_GNU_MBIND_NUM, no GNU_MBIND segment will be created.  Each
>> GNU_MBIND segment must be aligned at page boundary.  The interpretation of
>> the special memory area information is implementation-dependent.
>> Implementations may ignore GNU_MBIND segment.
>>
>> Run-time support
>>
>> int __gnu_mbind_setup (unsigned int type, void *addr, size_t length);
>>
>> It sets up special memory area of 'type' and 'length' at 'addr' where
>> 'addr' is a multiple of page size.  It returns zero for success, positive
>> value of ERRNO for non-fatal error and negative value of ERRNO for fatal
>> error.
>>
>> After all shared objects and the executable file are loaded, relocations
>> are processed, for each GNU_MBIND segment in a shared object or the
>> executable file, run-time loader calls __gnu_mbind_setup with type,
>> address and length.  The default implementation of __gnu_mbind_setup is
>
> Why does it run _after_ all shared objects and the executable file are loaded?

Since __gnu_mbind_setup may call any external functions, it can only
be done after everything is loaded and relocated.

> Why not let the dynamic loader choose when it needs to setup the memory?

1. We want to be able to add support for new type memory by just
updating the run-time library of __gnu_mbind_setup, instead of
updating glibc.
2. Since __gnu_mbind_setup may depend on other libraries, we
don't want a simple executable requires libfoo and libbar, in addition
to glibc, nor make libfoo and libbar part of glibc.

>> int
>> __gnu_mbind_setup (unsigned int type, void *addr, size_t length)
>> {
>>   return 0;
>> }
>>
>> which can be overridden by a different implementation at link-time.
>
> What if you _can't_ bind at ADDR?

It happens on systems without special memory.  __gnu_mbind_setup
returns a positive value and ld.so keeps going.

> What if the binding would work if ADD was any value?
>

GNU_MBIND isn't a LOAD segment,  similar to GNU_RELRO:

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x00000000 0x00000000 0x54624 0x54624 R E 0x1000
  LOAD           0x054e9c 0x00055e9c 0x00055e9c 0x001b0 0x001b8 RW  0x1000
  DYNAMIC        0x054eac 0x00055eac 0x00055eac 0x00110 0x00110 RW  0x4
  NOTE           0x000114 0x00000114 0x00000114 0x00044 0x00044 R   0x4
  GNU_EH_FRAME   0x048eb8 0x00048eb8 0x00048eb8 0x00ff4 0x00ff4 R   0x4
  GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x10
  GNU_RELRO      0x054e9c 0x00055e9c 0x00055e9c 0x00164 0x00164 R   0x1

ADDR contains the start of a memory region within the LOAD segment.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: RFC: ABI support for special memory area
  2017-01-01  0:00                                 ` Suprateeka R Hegde
@ 2017-01-01  0:00                                   ` H.J. Lu
  2017-01-01  0:00                                     ` Suprateeka R Hegde
  0 siblings, 1 reply; 33+ messages in thread
From: H.J. Lu @ 2017-01-01  0:00 UTC (permalink / raw)
  To: Suprateeka R Hegde; +Cc: Carlos O'Donell, gnu-gabi

On Thu, Mar 30, 2017 at 9:02 AM, Suprateeka R Hegde
<hegdesmailbox@gmail.com> wrote:
> On 03/27/17 21:45, H.J. Lu wrote:
>>
>> There is a way to support GNU_MBIND segments without the glibc changes.
>> Instead, dl_iterate_phdr
>>
>> int dl_iterate_phdr (int (*callback) (struct dl_phdr_info *info,
>>                                       size_t size, void *data),
>>                      void *data);
>>
>> is called via the .init_array section to process GNU_MBIND segments in
>> executable and shared objects:
>>
>> static int
>> callback (struct dl_phdr_info *info, size_t size, void *data)
>> {
>>   Compute the load address of the current module.
>>   if info->dlpi_addr == the load address of the current module
>>     {
>>       check ELF program headers and process GNU_MBIND segments
>>       return 1;
>>     }
>>
>>   return 0;
>> }
>>
>> static void
>> call_gnu_mbind_setup (void)
>> {
>>   dl_iterate_phdr (callback, NULL);
>> }
>>
>> static void (*init_array) (void)
>>  __attribute__ ((section (".init_array"), used))
>>  = &call_gnu_mbind_setup;
>
> This looks very ideal and perfect and matches my requirement too. Are
> you suggesting this dl_iterate_phdr(3) as the way in your proposal
> instead of the __gnu_mbind_setup?

Yes.

> Or are you suggesting that for all the implementations  that need
> different arguments (like that of my NVM) compared to
> __gnu_mbind_setup_v1, we go with this dl_iterate_phdr(3) way?

__gnu_mbind_setup_v1 is removed so that it will work with
existing C libraries with dl_iterate_phdr.

> I am OK either way.
>
> However, I am just thinking that your earlier approach --
> __gnu_mbind_setup -- is better when shared libraries with GNU_MBIND
> segments are dlopen'ed. They dont have to iterate all over again to
> reach their PHDR. Or what is the recommendation for such dlopen'ed
> libraries?

It is true that dl_iterate_phdr is called by every shared object, dlopened or
not, to locate its own PHDR.

> And this dl_iterate_phdr(3) not being part of any standards, may change
> in a totally incompatible way in the future.
>

dl_iterate_phdr isn't in any standard.  But it is in glibc.  Given that my
proposal is a GNU extension, it isn't a major issue.  Working with
existing glibc is a big plus.

-- 
H.J.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-03-31 15:11 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-01  0:00 RFC: ABI support for special memory area H.J. Lu
2017-01-01  0:00 ` Carlos O'Donell
2017-01-01  0:00   ` H.J. Lu
2017-01-01  0:00     ` Carlos O'Donell
2017-01-01  0:00       ` H.J. Lu
2017-01-01  0:00         ` Florian Weimer
2017-01-01  0:00           ` H.J. Lu
2017-01-01  0:00 ` Suprateeka R Hegde
2017-01-01  0:00   ` H.J. Lu
2017-01-01  0:00     ` Suprateeka R Hegde
2017-01-01  0:00       ` H.J. Lu
2017-01-01  0:00         ` Carlos O'Donell
2017-01-01  0:00           ` Suprateeka R Hegde
2017-01-01  0:00             ` H.J. Lu
2017-01-01  0:00               ` Suprateeka R Hegde
2017-01-01  0:00                 ` H.J. Lu
2017-01-01  0:00                   ` H.J. Lu
2017-01-01  0:00                     ` Florian Weimer
2017-01-01  0:00                       ` H.J. Lu
2017-01-01  0:00                         ` Florian Weimer
2017-01-01  0:00                           ` H.J. Lu
2017-01-01  0:00                             ` Florian Weimer
2017-01-01  0:00                               ` H.J. Lu
2017-01-01  0:00                     ` Suprateeka R Hegde
2017-01-01  0:00                       ` H.J. Lu
2017-01-01  0:00                         ` Suprateeka R Hegde
2017-01-01  0:00                           ` H.J. Lu
2017-01-01  0:00                             ` Suprateeka R Hegde
2017-01-01  0:00                               ` H.J. Lu
2017-01-01  0:00                                 ` Suprateeka R Hegde
2017-01-01  0:00                                   ` H.J. Lu
2017-01-01  0:00                                     ` Suprateeka R Hegde
2017-01-01  0:00                                       ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).