Jump to content











Photo

Convert UTF16 file to UTF8 via command line


  • Please log in to reply
12 replies to this topic

#1 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 22 November 2015 - 08:20 PM

This is a batch file question regarding the usage of sed v407 from the UnxUtils.

 

I do have some files in some directories that contain unicode chars.

Goal is to create a renaming script that does some substitutions to the filenames.

As usual, Dir /B /S /A:-D "DirPath"|sed "DoSomething" >"OutFile" produces an ansi file.

Unicode file or directory names cannot be displayed and become "?".

 

This one works to produce a UTF16LE file containing dirs/files with unicode chars:

@ECHO OFF

:: change codepage

chcp 1252>Nul

:: create a file with a byte order mark [UTF16LE]

ECHO ÿþ>"Dir16LE.txt"

:: cmd.exe /Unicode /ExecuteCmd'sAndExit; redirect dir results to 'Dir16LE.txt'

CMD /U /C DIR /B /S /A:-D "DirPath" >> "Dir16LE.txt"

ECHO End Of File&PAUSE>NUL&EXIT

 

Unfortunately sed seems not to be able to process a UTF16/32 file without hardcore coding.

Processing a UTF8 file is possible.

I'm in need to produce a UTF8 file containing  dirs/files with unicode chars.

 

This does not work:

@ECHO OFF

chcp 1252>Nul

:: create byte order mark to point to a UTF8 file explicitly

ECHO >"Dir8.txt"

CMD /U /C DIR /B /S /A:-D "DirPath" >> "Dir8.txt"

:: also DIR /B /S /A:-D "DirPath" >> "Dir8.txt" does not work

ECHO End Of File&PAUSE>NUL&EXIT

 

I also tried suggestions from Rob van der Woude and created some "header files" instead of Echoing the BOM's.

http://www.robvanderwoude.com/type.php

No luck so far.

Does anybody know a solution?

 

Thanks in advance for any reply



#2 pscEx

pscEx

    Platinum Member

  • Team Reboot
  • 12707 posts
  • Location:Korschenbroich, Germany
  • Interests:What somebody else cannot do.
  •  
    European Union

Posted 23 November 2015 - 09:14 AM

TYPE {utf16} > {utf8} :cheers:

 

Used (and working) in multiPE:

ShellExecute,HIDE,cmd.exe,"/C type #$q%TheInf%#$q > #$q%TheInfSet%#$q"

Peter



#3 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 23 November 2015 - 10:18 AM

Thanks Peter for input.

 

I tried:

cmd /C TYPE "Dir16LE.txt" > "Dir8.txt"   <-- results in an ansi file

cmd /U /C TYPE "Dir16LE.txt" > "Dir8.txt"   <-- results in a binary file.

BTW: my text editor is PSPad



#4 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 23 November 2015 - 10:27 AM

Hmm,

very interesting.

 

Indeed, your code works for a file with .inf extension.

I simply took a file from the system32 dir of my system (win8.1) to test.

CMD /C TYPE "tpm.inf" > "tpm8.inf" <-- result is an UTF8 file

 

Maybe the used filename extension is of importance.

Will check this - thanks.



#5 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 23 November 2015 - 10:53 AM

Next attempt:

@ECHO OFF

chcp 1252

:: copy "ÿþ" using a header file

COPY "U16LE.Head" "Dir16LE.inf" >NUL 2>&1

CMD /U /C DIR /B /S /A:-D "D:\" >> "Dir16LE.inf"

TYPE "Dir16LE.inf" > "Dir8.inf"

 

Result is an ansi file :(

 

The inf files from the windows system seem to be

somehow different from the one generated by the batch. ?



#6 pscEx

pscEx

    Platinum Member

  • Team Reboot
  • 12707 posts
  • Location:Korschenbroich, Germany
  • Interests:What somebody else cannot do.
  •  
    European Union

Posted 23 November 2015 - 11:13 AM

Maybe I misunderstood your first post.

 

IMO UTF8 was eqivalent to ansi.

 

Post the first bytes of the UTF16, and post the bytes how they should be in the result.

 

BTW: IMO "ÿþ" is the flag for UNICODE in UTF16.

 

Peter



#7 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 23 November 2015 - 11:35 AM

The UTF file generated by my batch file example produces an empty UTF16LE file (Dir16LE.inf/txt) with a BOM

("ÿþ" = UTF16LE; "þÿ" = UTF16BE;  = UTF8).

 

Reading this file with PSPadHex:

the first two Bytes are FFFE (äquivalent to "ÿþ" ansi chars)

The file "apm.inf" from my system seems not to differ. It's first two bytes are also FFFE.

 

In the desired file the first three bytes should be EFBBBF (äquivalent to "" ansi chars)

But as far as I read and know this is not mandatory.

Converting apm.inf in a UTF8 file the way you described results in a UTF8 file without a BOM.



#8 pscEx

pscEx

    Platinum Member

  • Team Reboot
  • 12707 posts
  • Location:Korschenbroich, Germany
  • Interests:What somebody else cannot do.
  •  
    European Union

Posted 23 November 2015 - 11:43 AM

Converting apm.inf in a UTF8 file the way you described results in a UTF8 file without a BOM.

 

Sorry, that's what I need in my multiPE script. I never took care on the BOM.

 

Peter :dubbio:



#9 pscEx

pscEx

    Platinum Member

  • Team Reboot
  • 12707 posts
  • Location:Korschenbroich, Germany
  • Interests:What somebody else cannot do.
  •  
    European Union

Posted 23 November 2015 - 12:05 PM

ECHO  > ??? appends CRLF.

 

This batch creates an ansi file with the UTF8 BOM:

 

@ECHO OFF
Copy "C:\Scratch\BOM8" "C:\Scratch\Copy.txt"
Type C:\scratch\xxx.reg >> "C:\Scratch\Copy.txt"
 

Peter



#10 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 17 December 2015 - 06:27 PM

Peter,
sorry for absence.
Some of my commandline results (ACP 1252, OEMCP 850):

These work to create some empty UTF files  with BOM (credits to Carlos M.):

:: empty UTF8 (BOM)
CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=)<NUL > "%CD%\UTF8Bom.txt" 2>NUL

:: empty UTF16LE (BOM)
CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > "%CD%\UTF16LE.txt" 2>NUL

:: empty UTF16BE (BOM)
CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=þÿ)<NUL > "%CD%\UFTF16BE.txt" 2>NUL

This works to create a UTF16LE file and to append a unicode dir command to file.

CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > "%CD%\UTF16LE.txt" 2>NUL
CMD.EXE /D /U /C DIR /B /S /A:-D "D:\" >> "Dir16LE.txt"

Redirection/Appendance only work for UTF16LE files and always adds a windows LineFeed (CRLF).
It is not possible to redirect/append "anything" to a UTF8/UTF8BOM or UTF16BE file without altering the character set
(see command sequence above.) I gathered eperience in swapping and change of hex values.

 

In the meanwhile I found a very good code on cyberactive.com and wrote a commandline utility (vb6) that fits to my needs.
My utility detects the file encoding of a given SourceFile (ANSI, UTF8-NoBOM, UTF8-BOM, UTF16LE-BOM, UTF16BE-BOM).
Conversion of a given SourceFile as well as appendance is supported (the latter with some limitations).

It's also possible to create empty files with one of the UTF encodings above.
Maybe it's of use for some others.

 

Some examples:

zUTFConv /SourceFile

- detects the file encoding

 

zUTFConv /SourceFile /DestFile /UTF8B

- Read text and detect file encoding of SourceFile, write text to DestFile using UTF8-BOM encoding

- If DestFile is ommited change SourceFile in-place (SourceFile will be overwritten)

 

zUTFConv /SourceFile /DestFile /UTF8B /WLF

- Read text and detect file encoding of SourceFile, write text to DestFile using UTF8-BOM encoding

- ensure to use windows LineFeed (CRLF) for DestFile

- If DestFile is ommited change SourceFile in-place (SourceFile will be overwritten)

 

zUTFConv /SourceFile /DestFile /A

- Read text and detect file encoding of Source and DestFile

- If encodings of both files are identical, append text of SourceFile to DestFile

- If DestFile is ommited text of the SourceFile is duplicated

 

Please, any comments and suggestions are welcome.

 

zUTFConv Help:

Spoiler

Attached Files



#11 Icecube

Icecube

    Gold Member

  • Team Reboot
  • 1063 posts
  •  
    Belgium

Posted 21 December 2015 - 09:29 PM

iconv should also be able to convert text files in various encodings to another one:

iconv -f UTF-16LE -t UTF-8 -o output.txt input.txt

You can find iconv and sed in Cygwin

 

There also seems to be a Windows version of iconv: https://github.com/win-iconv/win-iconv



#12 Zharif

Zharif

    Frequent Member

  • .script developer
  • 172 posts
  • Location:Germany
  •  
    Germany

Posted 22 December 2015 - 10:31 PM

Thanks Icecube for your contribution.

Let's add that iconv (as well as sed) are also part of the GnuWin32 utilities package

(package batch installer and updater by michaelis).

http://sourceforge.n...gnuwin32/files/

 

Although I use this package (as well as the UnxUtils) since a very long time I never recognised this tool.

Thanks for pointing this out. Will try it as soon as possible.

 

But at first I want to work at my own program.

Till now it only supports internal conversion of  text for a given SourceFile to a DestFile.

My main problem is not solved so far because piped input of a text streams is still not supported.

I'm a really experienced user of sed and use it very often to alter text files of different kinds.

The more so, broken redirection of std out to a file if unicode chars are used is bugging me.

At first I'm in the need to understand the things that are happening behind the stage

(which surely will result in a very slow learning curve).

Thanks so far...



#13 Icecube

Icecube

    Gold Member

  • Team Reboot
  • 1063 posts
  •  
    Belgium

Posted 24 December 2015 - 02:58 PM

It might be worth to install Cygwin to have a POSIX environment instead of running windows ports (which are quite old) of the programs with cmd.

 

From a quick Google search I found the following at http://superuser.com...e-locale-issue:

It's starting to seem that the issue may not lie with SED.exe itself, but in the way that Windows doesn't handle code-pages very well in its cmd.exe console. Maybe it works in its PowerShell, but if I have to go there, I'd rather focus on Python instead. As far as I can see, Windows own pride and joy, UTF-16 (code-page 1200, msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) is available only to managed applications, whatever that means, but it surely doesn't work in the console.

 

 

This Cygwin page might shed some light on the issue:

https://cygwin.com/c...tup-locale.html






1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users