public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed
From: Dirk Gouders <dirk@gouders.net>
To: libc-help@sourceware.org
Subject: Re: Help: match '\0' with regexec(3)
Date: Sat, 03 Feb 2024 21:50:59 +0100	[thread overview]
Message-ID: <gho7cxuv9o.fsf@gouders.net> (raw)
In-Reply-To: <ghsf29uw2h.fsf@gouders.net> (Dirk Gouders's message of "Sat, 03 Feb 2024 21:33:42 +0100")

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

Hi again,

I'm very sorry: the mail was out and I found an error in the program
(corrected version attached).

This perhaps answers my unsureness about '.':

$ printf ".\na\0b\n" | ./test_regex 
Compiling regex "."
Testing string "610062"...
regexec match: pos 0 length 1
        "a"
Testing string "0062"...
regexec match: pos 2 length 1
        "b"

But this expression matches '\0':

$ printf "[^\\\x01-\\\xff]\na\0b\n" | ./test_regex 
Compiling regex "[^\x01-\xff]"
Testing string "610062"...
regexec match: pos 0 length 1
        "a"
Testing string "0062"...
regexec match: pos 1 length 1
        ""
Testing string "62"...
regexec match: pos 2 length 1
        "b"

Regards,

Dirk


[-- Attachment #2: regexec(3) test-program --]
[-- Type: text/plain, Size: 1749 bytes --]

#include <stdlib.h>
#include <stdio.h>
#include <regex.h>

int main()
{
        char *line = NULL;
        int ret;

        /*
         * \b (backspace) is used to produce bold or underlined text.
         */
        char *ref_regex = "[A-Za-z\b._-]+\\(..?\\)";
        static regex_t *preg;
        regmatch_t pmatch[1];
        char *line_ptr;
        size_t line_length;

        if (preg == NULL) {
                preg = calloc(sizeof(regex_t), 1);
                if (ret = regcomp(preg, ref_regex, REG_EXTENDED) != 0) {
                        fprintf(stderr, "regcomp() failed: %d\n", ret);
                        exit(1);
                }
        }
                
        while (1) {
                /* position to next non empty line */
                while (1) {
                        ret = getline(&line, &line_length, stdin);
        
                        if (ret == -1) {
                                regfree(preg);
                                return 0;
                        }
                        if (line_length > 1)
                                break;
                        if (line_length < 1) {
                                regfree(preg);
                                return 0;
                        }
                }
                
                printf("%s: while finished: %ld\n", __func__, line_length);
                line_ptr = line;

                while (1) {
                        ret = regexec(preg, line_ptr, 1, pmatch, 0);

                        if (ret != 0)
                                break;

                        printf("regexec match \"%s\"\n", line_ptr + pmatch[0].rm_so);

                        line_ptr += pmatch[0].rm_eo;
                }
        }
}

[-- Attachment #3: Type: text/plain, Size: 2626 bytes --]



Dirk Gouders <dirk@gouders.net> writes:

> Hi,
>
> I would like to ask for an explanation or hint to my error for my
> attempt to use regexec(3) to match null-characters ('\0').
>
> To illustrate it, I wrote the attached test-program and what I do not
> understand is why I get false match-positions when testing with a string
> that contains '\0' (I am not absolutely sure if '.' is supposed to match '\0').
>
> Here is some "normal" output:
>
> $ printf ".\nab\n" | ./test_regex
> Compiling regex "."
> Testing string "ab"...
> regexec match: pos 0 length 1
>         "ab"
> Testing string "b"...
> regexec match: pos 1 length 1
>         "b"
> Testing string ""...
>
> But when I insert a '\0' into that string, the result is confusing to
> me:
>
> $ printf ".\na\0b\n" | ./test_regex
> Compiling regex "."
> Testing string "a"...
> regexec match: pos 0 length 1
>         "a"
> Testing string ""...
> regexec match: pos 2 length 1
>         "b"
> Testing string "b"...
> regexec match: pos 2 length 1
>         "b"
> Testing string ""...
>
> My appologies in advance should this question be easy to answer myself
> if I had googled it correctly.
>
> Regards,
>
> Dirk
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <regex.h>
>
> int main()
> {
>         int ret;
>
>         char *line = NULL;
>         char *reg_expr = NULL;
>         size_t line_len = 256;
> 	size_t l;
>
>         static regex_t preg;
>
>         regmatch_t pmatch[1];
>
>                 
> 	ret = getline(&reg_expr, &line_len, stdin);
>
> 	if (ret < 1)
> 		exit(1);
>
> 	reg_expr[ret - 1] = '\0'; /* remove newline */
>
> 	printf("Compiling regex \"%s\"\n", reg_expr);
>
> 	if (ret = regcomp(&preg, reg_expr, REG_EXTENDED | REG_NEWLINE) != 0) {
> 		fprintf(stderr, "regcomp() failed: %d\n", ret);
> 		exit(1);
> 	}
>
>
> 	while (1) {
> 		ret = getline(&line, &line_len, stdin);
>         
> 		line[ret - 1] = '\0'; /* remove newline */
> 		line_len = ret - 1;
>
> 		if (ret < 1)
> 			break;
>
> 		for (int i = 0; i < line_len; i += l ? l : 1) {
>
> 			pmatch[0].rm_so = 0;
> 			pmatch[0].rm_eo = line_len - i;
>
> 			printf("Testing string \"");
> 			for (int j = i; j < line_len; j++)
> 				printf("%c", line[j]);
> 			printf("\"...\n");
>
> 			ret = regexec(&preg, line + i, 1, pmatch, REG_NOTEOL | REG_STARTEND);
>
> 			if (ret != 0) {
> 				printf("No match.\n");
> 				break;
> 			} else
> 				printf("regexec match: pos %u length %u\n\t\"%s\"\n",
> 				       pmatch[0].rm_so + i,
> 				       pmatch[0].rm_eo - pmatch[0].rm_so,
> 				       line + i + pmatch[0].rm_so);
>
> 			l = pmatch[0].rm_eo - pmatch[0].rm_so;
> 		}
> 	}
> }

      reply	other threads:[~2024-02-03 20:51 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-03 20:33 Dirk Gouders
2024-02-03 20:50 ` Dirk Gouders [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=gho7cxuv9o.fsf@gouders.net \
    --to=dirk@gouders.net \
    --cc=libc-help@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).