Re: RFE: enable buffering on null-terminated data

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Carl Edquist <edquist@cs.wisc.edu>
To: Zachary Santer <zsanter@gmail.com>
Cc: libc-alpha@sourceware.org, coreutils@gnu.org, p@draigbrady.com
Subject: Re: RFE: enable buffering on null-terminated data
Date: Thu, 14 Mar 2024 09:15:58 -0500 (CDT)	[thread overview]
Message-ID: <dab29084-d0e4-443c-4af7-62b7d2bb4ac2@cs.wisc.edu> (raw)
In-Reply-To: <CABkLJULSTemarEOFXj+8gOb4t-+dLYhfdGD1OF0E+zVRo=WQ3A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7508 bytes --]

On Mon, 11 Mar 2024, Zachary Santer wrote:

> On Mon, Mar 11, 2024 at 7:54 AM Carl Edquist <edquist@cs.wisc.edu> 
> wrote:
>>
>> (In my coprocess management library, I effectively run every coproc 
>> with --output=L by default, by eval'ing the output of 'env -i stdbuf 
>> -oL env', because most of the time for a coprocess, that's whats 
>> wanted/necessary.)
>
> Surrounded by 'set -a' and 'set +a', I guess? Now that's interesting.

Ah, no - I use the 'VAR=VAL command line' syntax so that it's specific to 
the command (it's not left exported to the shell).

Effectively the coprocess commands are run with

 	LD_PRELOAD=... _STDBUF_O=L command line

This allow running shell functions for the command line, which will all 
get the desired stdbuf behavior.  Because you can't pass a shell function 
(within the context of the current shell) as the command to stdbuf.

As far as I can tell, the stdbuf tool sets LD_PRELOAD (to point to 
libstdbuf.so) and your custom buffering options in _STDBUF_{I,O,E}, in the 
environment for the program it runs.  The double-env thing there is just a 
way to cleanly get exactly the env vars that stdbuf sets.  The values 
don't change, but since they are an implementation detail of stdbuf, it's 
a bit more portable to grab the values this way rather than hard code 
them.  This is done only once per shell session to extract the values, and 
save them to a private variable, and then they are used for the command 
line as show above.

Of course, if "command line" starts with "stdbuf --output=0" or whatever, 
that will override the new line-buffered default.

You can definitely export it to your shell though, either with 'set -a' 
like you said, or with the export command.  After that everything you run 
should get line-buffered stdio by default.

> I just added that to a script I have that prints lines output by another 
> command that it runs, generally a build script, to the command line, but 
> updating the same line over and over again. I want to see if it updates 
> more continuously like that.

So, a lot of times build scripts run a bunch of individual commands. 
Each of those commands has an implied flush when it terminates, so you 
will get the output from each of them promptly (as each command 
completes), even without using stdbuf.

Where things get sloppy is if you add some stuff in a pipeline after your 
build script, which results in things getting block-buffered along the 
way:

 	$ ./build.sh | sed s/what/ever/ | tee build.log

And there you will definitely see a difference.

 	sloppy () {
 		for x in {1..10}; do sleep .2; echo $x; done |
 		sed s/^/:::/ | cat
 	}

 	{
 		echo before:
 		sloppy
 		echo

 		export $(env -i stdbuf -oL env)

 		echo after:
 		sloppy
 	}

> Yeah, there's really no way to break what I'm doing into a standard 
> pipeline.

I admit I'm curious what you're up to  :)

> Of course, using line-buffered or unbuffered output in this situation 
> makes no sense. Where it might be useful in a pipeline is when an 
> earlier command in a pipeline might only print things occasionally, and 
> you want those things transformed and printed to the command line 
> immediately.

Right ... And in that case, losing the performance benefit of a larger 
block buffer is a smaller price to pay.

> My assumption is that line-buffering through setbuf(3) was implemented 
> for printing to the command line, so its availability to stdbuf(1) is 
> just a useful side effect.

Right, stdbuf(1) leverages setbuf(3).

setbuf(3) tweaks the buffering behavior of stdio streams (stdin, stdout, 
stderr, and anything else you open with, eg, fopen(3)).  It's not really 
limited to terminal applications, but yeah it makes it easier to ensure 
that your calls to printf(3) actually get output after each line (whether 
that's to a file or a pipe or a tty), without having to call an explicit 
fflush(3) of stdout every time.

stdbuf(1) sets LD_PRELOAD to libstdbuf.so for your program, causing it to 
call setbuf(3) at program startup based on the values of _STDBUF_* in the 
environment (which stdbuf(1) also sets).

(That's my read of it anyway.)

> In the BUGS section in the man page for stdbuf(1), we see: On GLIBC 
> platforms, specifying a buffer size, i.e., using fully buffered mode 
> will result in undefined operation.

Eheh xD

Oh, I imagine "undefined operation" means something more like 
"unspecified" here.  stdbuf(1) uses setbuf(3), so the behavior you'll get 
should be whatever the setbuf(3) from the libc on your system does.

I think all this means is that the C/POSIX standards are a bit loose about 
what is required of setbuf(3) when a buffer size is specified, and there 
is room in the standard for it to be interpreted as only a hint.

> If I'm not mistaken, then buffer modes other than 0 and L don't actually 
> work. Maybe I should count my blessings here. I don't know what's going 
> on in the background that would explain glibc not supporting any of 
> that, or stdbuf(1) implementing features that aren't supported on the 
> vast majority of systems where it will be installed.

Hey try it right?

Works for me (on glibc-2.23)

 	$ for s in 8k 16k 32k 1M; do
 	    echo ::: $s :::
 	    { stdbuf -o$s strace -ewrite tr 1 2
 	    } < /dev/zero 2>&1 > /dev/null | head -3
 	    echo
 	  done

 	::: 8k :::
 	write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192
 	write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192
 	write(1, "\0\0\0\0\0\0\0\0"..., 8192) = 8192

 	::: 16k :::
 	write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384
 	write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384
 	write(1, "\0\0\0\0\0\0\0\0"..., 16384) = 16384

 	::: 32k :::
 	write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768
 	write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768
 	write(1, "\0\0\0\0\0\0\0\0"..., 32768) = 32768

 	::: 1M :::
 	write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
 	write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
 	write(1, "\0\0\0\0\0\0\0\0"..., 1048576) = 1048576

>> It may just be that nobody has actually had a real need for it. 
>> (Yet?)
>
> I imagine if anybody has, they just set --output=0 and moved on. Bash 
> scripts aren't the fastest thing in the world, anyway.

Ouch.  Ouch.  Ouuuuch.  :)

While that's true if you're talking about bash itself doing the actual 
computation and data processing, the main work of the shell is making it 
easy to set up pipelines for other (very fast) programs to pass their data 
around.

The stdbuf tool is not meant for the shell!  It's meant for those very 
fast programs that the shell stands up.

Using stdbuf to tweak a very fast program, causing it to output more often 
at newlines over pipes rather than at block boundaries, does slow down 
those programs somewhat.  But as we've discussed, this is necessary for 
certain pipelines that have two-way communication (including coprocesses), 
or in general any time you want the output immediately.

What may not be obvious is that the shell does not need to get involved 
with writing input for a coprocess or reading its output - the shell can 
start other (very fast) programs with input/output redirected to/from the 
coprocess pipes to do that processing.

My point though earlier was that a null-terminated record buffering mode, 
as useful as it sounds on the surface (for null-terminated paths), may 
actually be something _nobody_ has ever actually needed for an actual (not 
contrived) workflow.

But then again I say "Yet?" - because, never say never.

Happy line-buffering  :)

Carl

next prev parent reply	other threads:[~2024-03-14 15:14 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CABkLJULa8c0zr1BkzWLTpAxHBcpb15Xms0-Q2OOVCHiAHuL0uA@mail.gmail.com>
     [not found] ` <9831afe6-958a-fbd3-9434-05dd0c9b602a@draigBrady.com>
2024-03-10 15:29   ` Zachary Santer
2024-03-10 20:36     ` Carl Edquist
2024-03-11  3:48       ` Zachary Santer
2024-03-11 11:54         ` Carl Edquist
2024-03-11 15:12           ` Examples of concurrent coproc usage? Zachary Santer
2024-03-14  9:58             ` Carl Edquist
2024-03-17 19:40               ` Zachary Santer
2024-04-01 19:24               ` Chet Ramey
2024-04-01 19:31                 ` Chet Ramey
2024-04-02 16:22                   ` Carl Edquist
2024-04-03 13:54                     ` Chet Ramey
2024-04-03 14:32               ` Chet Ramey
2024-04-03 17:19                 ` Zachary Santer
2024-04-08 15:07                   ` Chet Ramey
2024-04-09  3:44                     ` Zachary Santer
2024-04-13 18:45                       ` Chet Ramey
2024-04-14  2:09                         ` Zachary Santer
2024-04-04 12:52                 ` Carl Edquist
2024-04-04 23:23                   ` Martin D Kealey
2024-04-08 19:50                     ` Chet Ramey
2024-04-09 14:46                       ` Zachary Santer
2024-04-13 18:51                         ` Chet Ramey
2024-04-09 15:58                       ` Carl Edquist
2024-04-13 20:10                         ` Chet Ramey
2024-04-14 18:43                           ` Zachary Santer
2024-04-15 18:55                             ` Chet Ramey
2024-04-15 17:01                           ` Carl Edquist
2024-04-17 14:20                             ` Chet Ramey
2024-04-20 22:04                               ` Carl Edquist
2024-04-22 16:06                                 ` Chet Ramey
2024-04-27 16:56                                   ` Carl Edquist
2024-04-28 17:50                                     ` Chet Ramey
2024-04-08 16:21                   ` Chet Ramey
2024-04-12 16:49                     ` Carl Edquist
2024-04-16 15:48                       ` Chet Ramey
2024-04-20 23:11                         ` Carl Edquist
2024-04-22 16:12                           ` Chet Ramey
2024-04-17 14:37               ` Chet Ramey
2024-04-20 22:04                 ` Carl Edquist
2024-03-12  3:34           ` RFE: enable buffering on null-terminated data Zachary Santer
2024-03-14 14:15             ` Carl Edquist [this message]
2024-03-18  0:12               ` Zachary Santer
2024-03-19  5:24                 ` Kaz Kylheku
2024-03-19 12:50                   ` Zachary Santer
2024-03-20  8:55                     ` Carl Edquist
2024-04-19  0:16                       ` Modify buffering of standard streams via environment variables (not LD_PRELOAD)? Zachary Santer
2024-04-19  9:32                         ` Pádraig Brady
2024-04-19 11:36                           ` Zachary Santer
2024-04-19 12:26                             ` Pádraig Brady
2024-04-19 16:11                               ` Zachary Santer
2024-04-20 16:00                         ` Carl Edquist
2024-04-20 20:00                           ` Zachary Santer
2024-04-20 21:45                             ` Carl Edquist

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dab29084-d0e4-443c-4af7-62b7d2bb4ac2@cs.wisc.edu \
    --to=edquist@cs.wisc.edu \
    --cc=coreutils@gnu.org \
    --cc=libc-alpha@sourceware.org \
    --cc=p@draigbrady.com \
    --cc=zsanter@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).