public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* RE: gawk 4.1.4: CR separate char for CRLF files
@ 2017-08-14 10:36 Vermessung AVT - Wolfgang Rieger
  2017-08-15 15:41 ` Jannick
  2017-08-15 16:43 ` Achim Gratz
  0 siblings, 2 replies; 26+ messages in thread
From: Vermessung AVT - Wolfgang Rieger @ 2017-08-14 10:36 UTC (permalink / raw)
  To: cygwin

On Wed, 9 Aug 2017 10:38 +0000, Jannick wrote:

--- snip ---
> Now I can see the following *easy* solutions to the very situation here (input only for now):
>
> 1 - Inserting the BEGIN section as you suggested into more than 1k scripts (not feasible due to additional regression test workload) 
>
> 2 - Calling 'gawk -vRS=\r\n -vORS=\r\n' instead of 'gawk' (hack to turn back the additional the latest gawk's complexity, wrapper needed)
>
> 3 - Wrapping a d2u/u2d pipe solution (additional app and wrapper needed again)
>
> 4 - Using another compiled version of gawk which does *not* disable the out-of-the-box gawk feature to swallow CRs (cf., e.g., http://git.savannah.gnu.org/cgit/gawk.git/tree/awkgram.y#n3543), i.e.
> without the artificial obstacle to now know the EOL type of the input file ahead of running gawk.
>
>> It works in all my cases. The only disadvantage: you have to know what kind
>
>... plus the disadvantage to systematically amend all the scripts instead of having an external solution 
>
>> of files you want to handle in the awk script. The same awk script 
>> will not
>> work for DOS files as well as for linux files.
>
>... another issue originated by the change and which didn't exist before.
>
>> Best
>> 
>> Roger
>
> Please don't get me wrong, but this raises a real issue here and I am not sure which rationale other than 'let's get more of the Linux-feel' drove the decision.
>
> All the best,
> J. 
--- snip ---

Another solution which we have been using for many years now, though it might not be feasible for you:

We very rarely update Cygwin. We have been using Cygwin for some 15+ years now. We use tools like gawk (hundreds of scripts), head, tail, sort, etc. that we are using in shell scripts running under cmd.exe (no Unix shells involved). I soon realized that upgrades of Cygwin may cause troubles with existing scripts, so we only update if we really need to (e.g.: New functionality that would be important, 32 to 64 bit shift, eventually new Windows versions, bugs we needed to be fixed).

I have followed the discussions about the CR/LF behaviour changes in the past attentively and decided not to update in near future, because that would lead to a massive problem with many hundreds of scripts - hoping that sometimes there will be a change in gawk again.

What is Unix-like or OS-like or Posix-like behaviour in that context? You could argue that gawk interprets line endings like the underlying OS does (i. e., gawk reads LF in Unix and CR/LF in Win), or it interprets line endings in a Unix-style no matter of the underlying OS used. That's a developer's decision in my opinion.

But since with pipes or output redirection gawk used to write no CRs even in previous versions, we already had the problem that gawk had to accept *both* inputs, LF with or without CR. That worked widely fine so far, since most Windows and other application SW we use accept both record formats, fortunately (we had issues with SW upgrades of other vendors no longer accepting pure LF, but that only concerned a very small number of scripts). With the new approach in Cygwin that seems to be broken, so we did not upgrade Cygwin since then (we currently use gawk 4.1.3).

Of course the reason for that really annoying CR/LF thing is the arrogance and ignorance of MS, which caused innumerable of useless developers' hours when I think of the endless discussions and changes in Cygwin; but MS is the one who defines the standards because of its very market power, so we have to deal with it, if we like or not. I'd definitely prefer to use Unix for its powerful tools, but most of the SW we use is simply not available for Unix, and MS does not provide gawk etc. So we have to deal with that CR/LF issue in a pragmatic rather than in a more, say, philosophical approach: We need to run our scripts with as little changes as possible. So that's why we upgrade Cygwin as seldom as possible. It is a "living system", yes, which is great on the one side - but can be annoying in everyday practice.

In my opinion there should be at least an option for gawk to accept both LF and CR/LF line endings equally, preferably with a system variable so that there is no need to change the command line call of gawk at all. That's what I vote for.

Kind regards,
Wolfgang





--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: gawk 4.1.4: CR separate char for CRLF files
@ 2017-08-16 12:50 cyg Simple
  0 siblings, 0 replies; 26+ messages in thread
From: cyg Simple @ 2017-08-16 12:50 UTC (permalink / raw)
  To: cygwin

Vermessung AVT - Wolfgang Rieger writes:

> 5) You can always find a better way to do things, of course, I won't
> argue about that. Sometimes we thought about switching to Java or php
> or python or whatever. Maybe, we should. But we have a lot of running
> scripts, massive batch and parallel processing, and cmd.exe with
> minimum Cygwin (no X subsystem, no pile of tools, just a tiny
> installation) has worked great for many years - so why not use it?
> Just because it is not intended to use it that way?

Just because it is not intended to use it that way, yes, that is the
reason not to do it.  Just because it works now doesn't mean that it
will continue to work and you put yourself in jeopardy if you ever
update your software.  With your use of cmd.exe instead of a Cygwin
shell also puts you at risk of not being able to execute your scripts.
While Cygwin doesn't intentionally cause its binaries to not execute
outside of Cygwin support for those binaries is only supported if the
problems exist within the Cygwin shell as well.  So if an executable
provides expected results in bash but not in cmd, you lose.

-- 
cyg Simple

P.S.: You need to learn how to use a proper mail client and respond to
this list appropriately.  I had to "edit as new" and hand edit the mail
just to get proper quoting.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: gawk 4.1.4: CR separate char for CRLF files
@ 2017-08-16 12:19 Vermessung AVT - Wolfgang Rieger
  0 siblings, 0 replies; 26+ messages in thread
From: Vermessung AVT - Wolfgang Rieger @ 2017-08-16 12:19 UTC (permalink / raw)
  To: cygwin

Achim Gratz wrote:
Vermessung AVT - Wolfgang Rieger writes:
> Another solution which we have been using for many years now, though 
> it might not be feasible for you:

----------------------------- snip --------------------------------

Jannick, another idea I had thought of previously might eventually help:

There is the possibility in awk to include source code by @include "myfile.awk" syntax. I was sometimes thinking of providing a general awk script that could deal with oddities of any kind that could easily be changed just in myfile.awk when necessary, e.g. due to updates. You could even think of an optional environment variable to control which script to include. It should be easy to add such an @include line in all gawk scripts automatically. Did you thing of something like that?

Kind regards,
Wolfgang


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: gawk 4.1.4: CR separate char for CRLF files
@ 2017-08-16 12:09 Vermessung AVT - Wolfgang Rieger
  2017-08-16 12:26 ` Eric Blake
  0 siblings, 1 reply; 26+ messages in thread
From: Vermessung AVT - Wolfgang Rieger @ 2017-08-16 12:09 UTC (permalink / raw)
  To: cygwin

Achim Gratz wrote:
Vermessung AVT - Wolfgang Rieger writes:
> Another solution which we have been using for many years now, though 
> it might not be feasible for you:

Cygwin is, like it or not, a rolling distribution.

> We very rarely update Cygwin. We have been using Cygwin for some 15+ 
> years now. We use tools like gawk (hundreds of scripts), head, tail, 
> sort, etc. that we are using in shell scripts running under cmd.exe 
> (no Unix shells involved). I soon realized that upgrades of Cygwin may 
> cause troubles with existing scripts, so we only update if we really 
> need to (e.g.: New functionality that would be important, 32 to 64 bit 
> shift, eventually new Windows versions, bugs we needed to be fixed).

Hopefully the machine(s) runnning those scripts are isolated.

In your particular case you might be better off using MSys2 or GNUwin32 tools, although you'd still need a better way to deal with updates.
Also, audit your scripts for non-portable constructs, since those are the parts that most likely to break.  CMD scripting is a tough nut to crack if it's of any complexity and there are lots of things that are poorly or not officially documented.  I don't quite understand why you use POSIX tools, but specifically shun POSIX scripting.

> I have followed the discussions about the CR/LF behaviour changes in 
> the past attentively and decided not to update in near future, because 
> that would lead to a massive problem with many hundreds of scripts - 
> hoping that sometimes there will be a change in gawk again.

You'd better replace that hope with a feature request at gawk upstream.

> What is Unix-like or OS-like or Posix-like behaviour in that context?
> You could argue that gawk interprets line endings like the underlying 
> OS does (i. e., gawk reads LF in Unix and CR/LF in Win), or it 
> interprets line endings in a Unix-style no matter of the underlying OS 
> used. That's a developer's decision in my opinion.

Cygwin uses LF line endings (yes there are still text mounts, but you'd be better off pretending they don't exist).  When you're trying to use it for CRLF files, you need to wrap those invocations to do an explicit conversion.

https://cygwin.com/cygwin-ug-net/using-textbinary.html

> But since with pipes or output redirection gawk used to write no CRs 
> even in previous versions, we already had the problem that gawk had to 
> accept *both* inputs, LF with or without CR. That worked widely fine 
> so far, since most Windows and other application SW we use accept both 
> record formats, fortunately (we had issues with SW upgrades of other 
> vendors no longer accepting pure LF, but that only concerned a very 
> small number of scripts). With the new approach in Cygwin that seems 
> to be broken, so we did not upgrade Cygwin since then (we currently 
> use gawk 4.1.3).

Again, your attempt to freeze your system at some arbitrary point in time is misguided.  It'll never quite work out and chances are that when it breaks it will do so in ways that creates more work and forces you to do it in emergency mode, which is never a good thing.

> Of course the reason for that really annoying CR/LF thing is the 
> arrogance and ignorance of MS, which caused innumerable of useless 
> developers' hours when I think of the endless discussions and changes 
> in Cygwin; but MS is the one who defines the standards because of its 
> very market power, so we have to deal with it, if we like or not.

You really can't blame them for CRLF, they weren't and aren't the only ones using it and it's been in use long before Microsoft entered the scene.

> I'd definitely prefer to use Unix for its powerful tools, but most of 
> the SW we use is simply not available for Unix, and MS does not 
> provide gawk etc. So we have to deal with that CR/LF issue in a 
> pragmatic rather than in a more, say, philosophical approach: We need 
> to run our scripts with as little changes as possible. So that's why 
> we upgrade Cygwin as seldom as possible. It is a "living system", yes, 
> which is great on the one side - but can be annoying in everyday 
> practice.

Again, you'd better figure out how to transform your input (and possibly
output) so it'll conform to the conventions of the tool(s) you use, perhaps by providing a handful of wrapper scripts.  Alternatively, only use tools that adhere to the same set of conventions.

> In my opinion there should be at least an option for gawk to accept 
> both LF and CR/LF line endings equally, preferably with a system 
> variable so that there is no need to change the command line call of 
> gawk at all. That's what I vote for.

Yes, but please cast that vote with the upstream developers.  I reckon it'd be a generally useful function, so there's no point in providing it only on Cygwin.


Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

SD adaptations for KORG EX-800 and Poly-800MkII V0.9:
http://Synth.Stromeko.net/Downloads.html#KorgSDada




Dear Achim,

I fully agree to most of what you say. But:

1) As well as Cygwin is a rolling distrib my work is a "rolling work". And that is why I deal with it in what I call a pragmatic way: I need a working system with minimum maintaining effort. SW is as it is provided, and I have to adopt, since I mostly can't write my own.

2) When I started using Cygwin some 2 decades ago I was coming from Unix. C programming and awk were what I was used to. In fact, awk was my most favourite tool, even for developing small C-programs. When I was forced to switch to Windows I needed a way to do text and data processing in a feasible manner and to port several awk-scripts to Windows within short time - awk is a nearly perfect text processing tool till today, though not widely known. I don't know anything comparable in terms of ease of syntax (once you know C), compactness of code, flexibility, and, most important for me, I am very familiar with it. Somebody recommended to use Cygwin then, which I implemented and learned to work with and to like it. Decisions had to be made about scripting then, and for some reasons we ended up with cmd.exe and a couple of additional tools along with some major software. It was not at all ideal, but it was very easy and very flexible. Had we known to how the system once would grow, maybe we would have decided differently. Maybe. But I am not sure if we would have come so far. We are a service provider with the need to automate tools, we are not a software vendor.

3) Years later somebody recommended GnuWin as native port, which I immediately tried. However, we ran into serious problems with quoting, as single quote syntax did not work there (Unix and Cygwin: gawk '{print}' would have to be written as gawk "{print}"), which broke a lot of scripts, and there were other problems with providing special characters, quoting, etc. which I could not manage to solve. So we did not switch (and, besides, sometimes I was not sure if GnuWin was still an active system - Cygwin has great user groups and is very active).

4) We have learned a lot of how to incorporate Cygwin in cmd.exe, even with constructs like
for /f "usebackq ..." %%A in (`someprog ... ^| gawk '{...}' ^| something`) do ...
and a lot of other and even more complicated things. That may sound strange, but it works and has worked for many, many years now. A lot is possible!

5) You can always find a better way to do things, of course, I won't argue about that. Sometimes we thought about switching to Java or php or python or whatever. Maybe, we should. But we have a lot of running scripts, massive batch and parallel processing, and cmd.exe with minimum Cygwin (no X subsystem, no pile of tools, just a tiny installation) has worked great for many years - so why not use it? Just because it is not intended to use it that way?

6) We have a grown and growing system. To completely change the system would certainly mean months of work on developer's side. But we have no developer team. We work on projects which have to go on. We do programming where it is necessary in order to automate processes. Our clients don't care if we change software, they want results.

7) Yes, many things could be done much better. I'd like to have the perfect system. But there is no perfect system. Cygwin under cmd.exe works really fine once you have learned its specifics. In fact, Cygwin has done a really great job in our environment for nearly 2 decades so far, even if we mostly don't use it as intended!

There is one point where I disagree. You said,
> Again, you'd better figure out how to transform your input (and possibly
> output) so it'll conform to the conventions of the tool(s) you use, perhaps
> by providing a handful of wrapper scripts.  Alternatively, only use tools that
> adhere to the same set of conventions.
That is exactly what we do and have done so far as I explained above. The problem comes up when developers decide for any reasons to change the behaviour - which happened with the CRLF handling. You can argue that the previous CRLF handling of gawk was not posix conforming. To be honest, I never looked up posix specifications. I use the SW by trying how it works and adopt to it. A SW vendor may be forced to check for compatibility considerations before writing one single line of code (I doubt many of them do so). But I am not a SW vendor. I eventually take the gawk manual and write code and test it. I realise there is that CRLF thing and adopt my scripts accordingly. For many years that worked; the developers did not change the behaviour. Our "input and output perfectly conformed to the conventions" (which means for me, what the SW accepted). Some day they changed the conventions. The reasons are comprehensible, of course, yet it causes a big amount of troubles. That is where we are now. So I "adhere to the same set of conventions" by simply not updating now.

Maybe in 10 years time another developer group decides to change it another way for any other good reason. Every change in syntax will cause problems. If a SW tools allows several ways it can be assumed all of them will be used by different people. If that behaviour is changed it *will* cause problems for some.


Anyway, thanks for the suggestion to contact the upstream developers. I was not aware of that. Can you give me a hint where to go?

Kind regards,
Wolfgang


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 26+ messages in thread
* gawk 4.1.4: CR separate char for CRLF files
@ 2017-08-08 23:16 Jannick
  2017-08-08 23:23 ` Steven Penny
  0 siblings, 1 reply; 26+ messages in thread
From: Jannick @ 2017-08-08 23:16 UTC (permalink / raw)
  To: cygwin

Dear All,

the current version 4.1.4 of gawk appears to unpleasantly treat CR for CRLF
files, i.e. CR is not gracefully swallowed, but is a separate character.

This makes some, if not all, of the scripts we are working with here
useless, unless the input files are converted to LF which certainly is not
feasible. IIRC the issue did not show up some versions back. 

Is this a bug - or am I missing something here?

Thanks,
J. - living on Win10



--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-08-16 12:50 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-14 10:36 gawk 4.1.4: CR separate char for CRLF files Vermessung AVT - Wolfgang Rieger
2017-08-15 15:41 ` Jannick
2017-08-15 16:43 ` Achim Gratz
  -- strict thread matches above, loose matches on Subject: below --
2017-08-16 12:50 cyg Simple
2017-08-16 12:19 Vermessung AVT - Wolfgang Rieger
2017-08-16 12:09 Vermessung AVT - Wolfgang Rieger
2017-08-16 12:26 ` Eric Blake
2017-08-08 23:16 Jannick
2017-08-08 23:23 ` Steven Penny
2017-08-09  0:49   ` Jannick
2017-08-09  7:03     ` AW: " Roger Krebs
2017-08-09  8:38       ` Jannick
2017-08-09 11:03         ` Eric Blake
2017-08-09 19:09           ` Eric Blake
2017-08-10 12:04             ` cyg Simple
2017-08-10 12:31               ` David Macek
2017-08-10 14:46                 ` cyg Simple
2017-08-10 18:35                   ` Steven Penny
2017-08-10 21:34                     ` Brian Inglis
2017-08-10 21:49                       ` cyg Simple
2017-08-10 22:49                         ` Brian Inglis
2017-08-11 12:47                           ` cyg Simple
2017-08-11 16:54                             ` Brian Inglis
2017-08-11 17:06                               ` cyg Simple
2017-08-10 22:22                       ` Steven Penny
2017-08-10 22:49                         ` Brian Inglis
2017-08-10 23:59                           ` Steven Penny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).