From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 4972 invoked by alias); 19 Jul 2012 12:36:11 -0000 Received: (qmail 4857 invoked by uid 22791); 19 Jul 2012 12:36:10 -0000 X-SWARE-Spam-Status: No, hits=-4.8 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,KHOP_RCVD_TRUST,KHOP_THREADED,RCVD_IN_DNSWL_LOW,RCVD_IN_HOSTKARMA_YE,TW_VM X-Spam-Check-By: sourceware.org Received: from mail-gh0-f171.google.com (HELO mail-gh0-f171.google.com) (209.85.160.171) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Thu, 19 Jul 2012 12:35:57 +0000 Received: by ghy10 with SMTP id 10so2804456ghy.2 for ; Thu, 19 Jul 2012 05:35:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.60.20.74 with SMTP id l10mr2470892oee.19.1342701356457; Thu, 19 Jul 2012 05:35:56 -0700 (PDT) Received: by 10.182.65.134 with HTTP; Thu, 19 Jul 2012 05:35:56 -0700 (PDT) In-Reply-To: <20120719113927.GH31055@calimero.vinschen.de> References: <20120719092024.GA31055@calimero.vinschen.de> <20120719113927.GH31055@calimero.vinschen.de> Date: Thu, 19 Jul 2012 12:36:00 -0000 Message-ID: Subject: Re: length in gawk returns wrong value From: Csaba Raduly To: cygwin@cygwin.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2012-07/txt/msg00392.txt.bz2 On Thu, Jul 19, 2012 at 1:39 PM, Corinna Vinschen wrote: > On Jul 19 11:27, Ralf wrote: >> Corinna Vinschen cygwin.com> writes: >> >> > >> > Uh oh. 1.7.9 is old. Please update. >> > >> > > 0000000 R 374 c k e n \r \n >> > > 0000010 >> > > Length: 1 >> > > >> > > What can I do to get the correct length in gawk without changing >> > > ttt.txt? >> > >> > Dunno. This is not what I see. What did you have $LANG and $LC_CTYPE >> > set to? Here's what I see: >> > >> > $ uname -a >> > CYGWIN_NT-6.1 vmbert7 1.7.16(0.261/5/3) 2012-07-09 14:51 i686 Cygwin >> > >> > $ echo $LANG >> > C.UTF-8 >> > >> > $ echo "R=FCcken" > ttt.txt >> > $ od -c ttt.txt >> > 0000000 R 303 274 c k e n \n >> > 0000010 >> > >> > $ gawk '{print "Length: " length($0)}' ttt.txt >> > Length: 6 >> > >> > $ gawk --version | head -1 >> > GNU Awk 4.0.1 >> > >> > Corinna >> > >> >> After updating I added following lines on top of my script: >> export LANG=3DC.UTF-8 >> echo LANG: $LANG >> echo LC_CTYPE: $LC_TYPE >> c:/unix/bin/gawk --version | head -1 >> >> And this is my output: >> LANG: C.UTF-8 >> LC_CTYPE: >> GNU Awk 4.0.1 >> CYGWIN_NT-6.0-WOW64 WIESWEG 1.7.15(0.260/5/3) 2012-05-09 10:25 i686 Cyg= win >> 0000000 R 374 c k e n \r \n >> 0000010 >> Length: 5 >> >> Very strange! > > Not at all. The file contains an invalid character. 0374 is the > umlaut-u in the ISO-8859-1 or ISO-8859-15 codesets. Try this: > > $ LC_ALL=3Dde_DE gawk '{print "Length: " length($0)}' ttt.txt > Length: 6 > > When you create the file under the UTF-8 codeset, you'll get: > > 0000000 R 303 274 c k e n \n > Proving, once again, that "There Ain't No Such Thing as Plain Text" http://www.joelonsoftware.com/articles/Unicode.html Csaba --=20 GCS a+ e++ d- C++ ULS$ L+$ !E- W++ P+++$ w++$ tv+ b++ DI D++ 5++ The Tao of math: The numbers you can count are not the real numbers. Life is complex, with real and imaginary parts. "Ok, it boots. Which means it must be bug-free and perfect. " -- Linus Torv= alds "People disagree with me. I just ignore them." -- Linus Torvalds -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple