From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21431 invoked by alias); 19 Jul 2012 11:40:08 -0000 Received: (qmail 20977 invoked by uid 22791); 19 Jul 2012 11:39:46 -0000 X-Spam-Check-By: sourceware.org Received: from aquarius.hirmke.de (HELO calimero.vinschen.de) (217.91.18.234) by sourceware.org (qpsmtpd/0.83/v0.83-20-g38e4449) with ESMTP; Thu, 19 Jul 2012 11:39:30 +0000 Received: by calimero.vinschen.de (Postfix, from userid 500) id AFBBE2C0074; Thu, 19 Jul 2012 13:39:27 +0200 (CEST) Date: Thu, 19 Jul 2012 11:40:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com Subject: Re: length in gawk returns wrong value Message-ID: <20120719113927.GH31055@calimero.vinschen.de> Reply-To: cygwin@cygwin.com Mail-Followup-To: cygwin@cygwin.com References: <20120719092024.GA31055@calimero.vinschen.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2012-07/txt/msg00391.txt.bz2 On Jul 19 11:27, Ralf wrote: > Corinna Vinschen cygwin.com> writes: > > > > > Uh oh. 1.7.9 is old. Please update. > > > > > 0000000 R 374 c k e n \r \n > > > 0000010 > > > Length: 1 > > > > > > What can I do to get the correct length in gawk without changing > > > ttt.txt? > > > > Dunno. This is not what I see. What did you have $LANG and $LC_CTYPE > > set to? Here's what I see: > > > > $ uname -a > > CYGWIN_NT-6.1 vmbert7 1.7.16(0.261/5/3) 2012-07-09 14:51 i686 Cygwin > > > > $ echo $LANG > > C.UTF-8 > > > > $ echo "Rücken" > ttt.txt > > $ od -c ttt.txt > > 0000000 R 303 274 c k e n \n > > 0000010 > > > > $ gawk '{print "Length: " length($0)}' ttt.txt > > Length: 6 > > > > $ gawk --version | head -1 > > GNU Awk 4.0.1 > > > > Corinna > > > > After updating I added following lines on top of my script: > export LANG=C.UTF-8 > echo LANG: $LANG > echo LC_CTYPE: $LC_TYPE > c:/unix/bin/gawk --version | head -1 > > And this is my output: > LANG: C.UTF-8 > LC_CTYPE: > GNU Awk 4.0.1 > CYGWIN_NT-6.0-WOW64 WIESWEG 1.7.15(0.260/5/3) 2012-05-09 10:25 i686 Cygwin > 0000000 R 374 c k e n \r \n > 0000010 > Length: 5 > > Very strange! Not at all. The file contains an invalid character. 0374 is the umlaut-u in the ISO-8859-1 or ISO-8859-15 codesets. Try this: $ LC_ALL=de_DE gawk '{print "Length: " length($0)}' ttt.txt Length: 6 When you create the file under the UTF-8 codeset, you'll get: 0000000 R 303 274 c k e n \n Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple