Re: RFE: enable buffering on null-terminated data

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Zachary Santer <zsanter@gmail.com>
To: Carl Edquist <edquist@cs.wisc.edu>
Cc: libc-alpha@sourceware.org, coreutils@gnu.org, p@draigbrady.com
Subject: Re: RFE: enable buffering on null-terminated data
Date: Sun, 10 Mar 2024 23:48:12 -0400	[thread overview]
Message-ID: <CABkLJULka=Ox-WVNfqzeLYs1dX0h7ovnfjeRdqGSFcqVMJ47KQ@mail.gmail.com> (raw)
In-Reply-To: <317fe0e2-8cf9-d4ac-ed56-e6ebcc2baa55@cs.wisc.edu>

[-- Attachment #1: Type: text/plain, Size: 5283 bytes --]

On Sun, Mar 10, 2024 at 4:36 PM Carl Edquist <edquist@cs.wisc.edu> wrote:
>
> Hi Zack,
>
> This sounds like a potentially useful feature (it'd probably belong with a
> corresponding new buffer mode in setbuf(3)) ...
>
> > Filenames should be passed between utilities in a null-terminated
> > fashion, because the null byte is the only byte that can't appear within
> > one.
>
> Out of curiosity, do you have an example command line for your use case?

My use for 'stdbuf --output=L' is to be able to run a command within a
bash coprocess. (Really, a background process communicating with the
parent process through FIFOs, since Bash prints a warning message if
you try to run more than one coprocess at a time. Shouldn't make a
difference here.) See coproc-buffering, attached. Without making the
command's output either line-buffered or unbuffered, what I'm doing
there would deadlock. I feed one line in and then expect to be able to
read a transformed line immediately. If that transformed line is stuck
in a buffer that's still waiting to be filled, then nothing happens.

I swear doing this actually makes sense in my application.

$ ./coproc-buffering 100000
Line-buffered:
real    0m17.795s
user    0m6.234s
sys     0m11.469s
Unbuffered:
real    0m21.656s
user    0m6.609s
sys     0m14.906s

When I initially implemented this thing, I felt lucky that the data I
was passing in were lines ending in newlines, and not null-terminated,
since my script gets to benefit from 'stdbuf --output=L'. Truth be
told, I don't currently have a need for --output=N. Of course, sed and
all sorts of other Linux command-line tools can produce or handle
null-terminated data.

> > If I want to buffer output data on null bytes, the closest I can get is
> > 'stdbuf --output=0', which doesn't buffer at all. This is pretty
> > inefficient.
>
> I'm just thinking that find(1), for instance, will end up calling write(2)
> exactly once per filename (-print or -print0) if run under stdbuf
> unbuffered, which is the same as you'd get with a corresponding stdbuf
> line-buffered mode (newline or null-terminated).
>
> It seems that where line buffering improves performance over unbuffered is
> when there are several calls to (for example) printf(3) in constructing a
> single line.  find(1), and some filters like grep(1), will write a line at
> a time in unbuffered mode, and thus don't seem to benefit at all from line
> buffering.  On the other hand, cut(1) appears to putchar(3) a byte at a
> time, which in unbuffered mode will (like you say) be pretty inefficient.
>
> So, depending on your use case, a new null-terminated line buffered option
> may or may not actually improve efficiency over unbuffered mode.

I hadn't considered that.

> You can run your commands under strace like
>
>      stdbuf --output=X  strace -c -ewrite  command ... | ...
>
> to count the number of actual writes for each buffering mode.

I'm running bash in MSYS2 on a Windows machine, so hopefully that
doesn't invalidate any assumptions. Now setting up strace around the
things within the coprocess, and only passing in one line, I now have
coproc-buffering-strace, attached. Giving the argument 'L', both sed
and expand call write() once. Giving the argument 0, sed calls write()
twice and expand calls it a bunch of times, seemingly once for each
character it outputs. So I guess that's it.

$ ./coproc-buffering-strace L
|        Line with tabs   why?|

$ grep -c -F 'write:' sed-trace.txt expand-trace.txt
sed-trace.txt:1
expand-trace.txt:1

$ ./coproc-buffering-strace 0
|        Line with tabs   why?|

$ grep -c -F 'write:' sed-trace.txt expand-trace.txt
sed-trace.txt:2
expand-trace.txt:30

> Carl
>
>
> PS, "find -printf" recognizes a '\c' escape to flush the output, in case
> that helps.  So "find -printf '%p\0\c'" would, for instance, already
> behave the same as "stdbuf --output=N  find -print0" with the new stdbuf
> output mode you're suggesting.
>
> (Though again, this doesn't actually seem to be any more efficient than
> running "stdbuf --output=0  find -print0")
>
> On Sun, 10 Mar 2024, Zachary Santer wrote:
>
> > Was "stdbuf feature request - line buffering but for null-terminated data"
> >
> > See below.
> >
> > On Sun, Mar 10, 2024 at 5:38 AM Pádraig Brady <P@draigbrady.com> wrote:
> >>
> >> On 09/03/2024 16:30, Zachary Santer wrote:
> >>> 'stdbuf --output=L' will line-buffer the command's output stream.
> >>> Pretty useful, but that's looking for newlines. Filenames should be
> >>> passed between utilities in a null-terminated fashion, because the
> >>> null byte is the only byte that can't appear within one.
> >>>
> >>> If I want to buffer output data on null bytes, the closest I can get
> >>> is 'stdbuf --output=0', which doesn't buffer at all. This is pretty
> >>> inefficient.
> >>>
> >>> 0 means unbuffered, and Z is already taken for, I guess, zebibytes.
> >>> --output=N, then?
> >>>
> >>> Would this require a change to libc implementations, or is it possible now?
> >>
> >> This does seem like useful functionality,
> >> but it would require support for libc implementations first.
> >>
> >> cheers,
> >> Pádraig
> >
> >

[-- Attachment #2: coproc-buffering --]
[-- Type: application/octet-stream, Size: 1154 bytes --]

#!/usr/bin/env bash

set -o nounset -o noglob +o braceexpand
shopt -s lastpipe
export LC_ALL='C.UTF-8'

tab_spaces=8

sed_expr='s/[[:blank:]]+$//'

test=$'  \tLine with tabs\t why?\t  '

repeat="${1}"

coproc line_buffered {
  stdbuf --output=L -- \
      sed --binary --regexp-extended --expression="${sed_expr}" |
    stdbuf --output=L -- \
        expand --tabs="${tab_spaces}"
}

printf '%s' "Line-buffered:"
time {
  for (( i = 0; i < repeat; i++ )); do
    printf '%s\n' "${test}" >&"${line_buffered[1]}"
    IFS='' read -r line <&"${line_buffered[0]}"
    printf '|%s|\n' "${line}" > /dev/null
  done
}

exec {line_buffered[0]}<&- {line_buffered[1]}>&-
wait "${line_buffered_PID}"

coproc unbuffered {
  stdbuf --output=0 -- \
      sed --binary --regexp-extended --expression="${sed_expr}" |
    stdbuf --output=0 -- \
        expand --tabs="${tab_spaces}"
}

printf '%s' "Unbuffered:"
time {
  for (( i = 0; i < repeat; i++ )); do
    printf '%s\n' "${test}" >&"${unbuffered[1]}"
    IFS='' read -r line <&"${unbuffered[0]}"
    printf '|%s|\n' "${line}" > /dev/null
  done
}

exec {unbuffered[0]}<&- {unbuffered[1]}>&-
wait "${unbuffered_PID}"

[-- Attachment #3: coproc-buffering-strace --]
[-- Type: application/octet-stream, Size: 695 bytes --]

#!/usr/bin/env bash

set -o nounset -o noglob +o braceexpand
shopt -s lastpipe
export LC_ALL='C.UTF-8'

tab_spaces=8

sed_expr='s/[[:blank:]]+$//'

test=$'  \tLine with tabs\t why?\t  '

buffer_setting="${1}"

coproc buffer_test {
  stdbuf --output="${buffer_setting}" -- \
      strace -e -o sed-trace.txt \
      sed --binary --regexp-extended --expression="${sed_expr}" |
   stdbuf --output="${buffer_setting}" -- \
       strace -e -o expand-trace.txt \
       expand --tabs="${tab_spaces}"
}

printf '%s\n' "${test}" >&"${buffer_test[1]}"
IFS='' read -r line <&"${buffer_test[0]}"
printf '|%s|\n' "${line//$'\t'/TAB}"

exec {buffer_test[0]}<&- {buffer_test[1]}>&-
wait "${buffer_test_PID}"

next prev parent reply	other threads:[~2024-03-11  3:48 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CABkLJULa8c0zr1BkzWLTpAxHBcpb15Xms0-Q2OOVCHiAHuL0uA@mail.gmail.com>
     [not found] ` <9831afe6-958a-fbd3-9434-05dd0c9b602a@draigBrady.com>
2024-03-10 15:29   ` Zachary Santer
2024-03-10 20:36     ` Carl Edquist
2024-03-11  3:48       ` Zachary Santer [this message]
2024-03-11 11:54         ` Carl Edquist
2024-03-11 15:12           ` Examples of concurrent coproc usage? Zachary Santer
2024-03-14  9:58             ` Carl Edquist
2024-03-17 19:40               ` Zachary Santer
2024-04-01 19:24               ` Chet Ramey
2024-04-01 19:31                 ` Chet Ramey
2024-04-02 16:22                   ` Carl Edquist
2024-04-03 13:54                     ` Chet Ramey
2024-04-03 14:32               ` Chet Ramey
2024-04-03 17:19                 ` Zachary Santer
2024-04-08 15:07                   ` Chet Ramey
2024-04-09  3:44                     ` Zachary Santer
2024-04-13 18:45                       ` Chet Ramey
2024-04-14  2:09                         ` Zachary Santer
2024-04-04 12:52                 ` Carl Edquist
2024-04-04 23:23                   ` Martin D Kealey
2024-04-08 19:50                     ` Chet Ramey
2024-04-09 14:46                       ` Zachary Santer
2024-04-13 18:51                         ` Chet Ramey
2024-04-09 15:58                       ` Carl Edquist
2024-04-13 20:10                         ` Chet Ramey
2024-04-14 18:43                           ` Zachary Santer
2024-04-15 18:55                             ` Chet Ramey
2024-04-15 17:01                           ` Carl Edquist
2024-04-17 14:20                             ` Chet Ramey
2024-04-20 22:04                               ` Carl Edquist
2024-04-22 16:06                                 ` Chet Ramey
2024-04-27 16:56                                   ` Carl Edquist
2024-04-28 17:50                                     ` Chet Ramey
2024-04-08 16:21                   ` Chet Ramey
2024-04-12 16:49                     ` Carl Edquist
2024-04-16 15:48                       ` Chet Ramey
2024-04-20 23:11                         ` Carl Edquist
2024-04-22 16:12                           ` Chet Ramey
2024-04-17 14:37               ` Chet Ramey
2024-04-20 22:04                 ` Carl Edquist
2024-03-12  3:34           ` RFE: enable buffering on null-terminated data Zachary Santer
2024-03-14 14:15             ` Carl Edquist
2024-03-18  0:12               ` Zachary Santer
2024-03-19  5:24                 ` Kaz Kylheku
2024-03-19 12:50                   ` Zachary Santer
2024-03-20  8:55                     ` Carl Edquist
2024-04-19  0:16                       ` Modify buffering of standard streams via environment variables (not LD_PRELOAD)? Zachary Santer
2024-04-19  9:32                         ` Pádraig Brady
2024-04-19 11:36                           ` Zachary Santer
2024-04-19 12:26                             ` Pádraig Brady
2024-04-19 16:11                               ` Zachary Santer
2024-04-20 16:00                         ` Carl Edquist
2024-04-20 20:00                           ` Zachary Santer
2024-04-20 21:45                             ` Carl Edquist

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABkLJULka=Ox-WVNfqzeLYs1dX0h7ovnfjeRdqGSFcqVMJ47KQ@mail.gmail.com' \
    --to=zsanter@gmail.com \
    --cc=coreutils@gnu.org \
    --cc=edquist@cs.wisc.edu \
    --cc=libc-alpha@sourceware.org \
    --cc=p@draigbrady.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).