Jump to content











Photo

Small Guide to edit UTF8 files via batch

file encoding utf8 bom windows command line windows shell script

  • Please log in to reply
3 replies to this topic

#1 Zharif

Zharif

    Frequent Member

  • .script developer
  • 159 posts
  • Location:Germany
  •  
    Germany

Posted 27 May 2020 - 03:46 PM

For the last couple of years I had to deal with UTF8 files that were produced by sql-database parsers, especially  for IBM SPSS Statistics. When trying to do write, copy or merge operations with such files via windows shell scripts I stumbled about some weird behaviours of the windows command processor.

As suggested, I will try to share my experience and results, beeing in hope that these migtht be useful for some of you.
At this point I would also like to thank Wonko for contributions.
Please note that descriptions and explainations below are intended for practial use only.
Providing deeper background information would go beyond the scope of this thread.

I want to operate with UTF8 files, but trying to achieve this with a windows shell script is not an easy task. Moreover, cases of file corruption should be considered while dealing with such files. Therefore, providing some background information seems to be necessary to understand what I'm talking about.

Let's start with some (really simplified) information about codepages (= encoding of characters in a charset table).
Please, distinct between two codepages of interest: default OEM (OEMCP) and default ANSI (ACP).
Both are used by the windows system for non-unicode programs.
There's also the LCID (Locale-ID). From my current experience it is still often found in windows *.inf files.

Registry entries:

HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP    =  OEM codepage
HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP     =  ANSI codepage
HKLM\SYSTEM\CurrentControlSet\Control\Nls\Language\Default  =  LCID

Powershell equivalents:

Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage|Select-Object OEMCP, ACP
Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\Language|Select-Object Default
[System.Text.Encoding]::Default

Note, these may be overwritten by user defined settings.


Any char has its specific position and hex representation in a mapped charset table.
In OEMCP's (non-unicode for console) and ACP's (non-unicode for GUI) chars are mapped as single byte. If position of a char equals in both codepages, you could display (or save) this content without any concerns.
Displaying or saving issues (not only in console) may occur if the currently mapped codepage does not equal the encoding environment in which text or a file has been created/saved.

As an example, default ACP for a German windows is 1252 (Western Europe).
But until today, OEMCP inside console is 850.
Positions of german umlauts or accents differ between both charsets.
That's why we have to switch to cp 1252 inside console window or a batch for displaying these from a file correctly.

There's also the UTF8 codepage (cp 65001).

From my current knowledge it should be available on nearly all windows systems. In contrast to OEMCP and ACP, this codepage stores chars 0-127 (hex 00-7F) as single byte but is also able to store unicode chars as double byte sequences.
Therefore, UTF8 files doesn't consume more disk space than ANSI files if content are single byte chars.
File size only increases for the amout of chars that are stored as double bytes in contrast to UTF16 encoded files where chars are always stored as double byte sequences. Here, file size is always doubled at minimum compared to an ANSI file - regardless of the content.

Without trying to make things complicated, it is important to know that some single byte chars in OEMCP 850 and ACP 1252 are mapped as double byte chars in UTF8.
Examples for german systems are umlauts (ÄÖÜÄÖÜß) and the accent acute.
Accent graph- or circumflex still remain as single byte.

Something about "character alterations":
Let's distinct between displaying issues only or real change of characters.
The latter can be evaluated by using a trustworthy hex editor.
Alterations e.g. in a text editor may occur while viewing a file or after writing to it.

1.    Halfway harmless cases (viewing issues only):
 -     a utf8 file with unicode content has been opened by mapping an inappropriate
        charset table (wrong codepage) that does not support double byte characters.
        Default OEMCP or ACP are such examples. As any char is expressed as single byte,
        double byte chars in UTF8 files are usually shown as question marks in the editor.
        Don't save this file before closing. This will result in a conversion to ANSI.
        At next opening all double byte chars are replaced by questionmarks and there's
        no way to restore the original text.
-       a utf8 file with unicode content has been opened in a codepage environment that
        supports double- or multibyte chars, but position of these from the current
        codepage do not match the position in the utf8 charset.
        In these cases, "strange" glyphs are displayed instead of expected chars.
        Don't save this file before closing. This will change the encoding to the current one.
        At next opening, character alterations still exist and there's no way to
        restore the original text.

2.    Some really harmful cases after writing - these lead to final file corruption
-      a utf8 file (not opened) contains double byte chars and you try to append text
        to it via console or batch. Or you try to merge the whole content of two files.
        If codepage environment inside console is not the same as the encoding of
        the target file, a real change of characters occurs that cannot be undone.
        Case 1:  Writing environment is OEMCP or ACP
                      As long as you append single byte chars to a utf8 file that are
                      also expressed as single byte in the utf8 charset (0-127 hex 00 to 7F),
                      and position in the current OEMCP or ACP charset equals position of
                      the utf8 charset nothing special will happen.
                      At next opening the target file is still utf8;
                      text/chars are written and shown as expected.
                      In any other case, the utf8 file will be converted back to ACP.
                      Opening the file after appending text shows its new encoding,
                      unicode/double byte chars are replaced with questionmarks
                      and there's no way to recover the original content.
      Case 2:    Writing environment supports unicode but does not match the UTF8 codepage.
                      To make things as short as possible here: Whatever you do, result may
                      be a mix of both encodings, a partial translation of double- or multibyte
                      chars or a reverting back to default ACP.
                      File may or may not remain encoded as UTF8 and there's no way to
                      recover the original content of the target file.

One last thing about the Byte Order Mark (BOM) of a UTF8 file.
A UTF8-BOM is an identifier sequence that may or may not be written as the first three bytes in a UTF8 file.

In hex it is EF BB BF, as glyphs "".
This byte sequence is not visible when opening/viewing a UTF file.
There is much generalized information around about the need of adding or removing a BOM for applications or programming languages to run correctly.  In fact, cases exist where a UTF8-BOM breaks execution of programs. However, this really depends on specific operations inside an application or programming language.
Just a little example:
IBM SPSS Statistics is written in Java. Many web-sources recommend to avoid a BOM in UTF8 files that are processed by Java (some not). In contrast, SPSS-users (like me) continously experience that a BOM is mandatory for SPSS-Syntax files (a build-in scripting language) to detect if unicode chars are involved (versions 21-26).

Below, I tried to formulate some tasks for handling UTF8 files using a windows shell script.

Solutions were extensively tested in Windows10, also in Windows7 and partially in WinXP (SP3).
The latter was a great dissapointment.
Although any of the solutions below can be done by typing them inside the console window,
XP breaks processing of batch files if any of these solutions require switching to cp 65001 (UTF8).

All solving examples are (hopefully) well documented and are written as stand-alone code blocks.
Therefore, they are somehow inflated. Feel free to remove any unnecessary content.

 

Any provided code is also part of the attached *.zip archive at the bottom of this post.

This archive also contains the needed template files mentioned below.
_______________________________________________________________________________________________________________________

TASKS:

1.     How to detect the file encoding?
        - Is a given file either encoded as UTF8 or as UTF8 with Byte Order Mark?

        And while we're on it, let's also detect if a file is encoded in
        - UTF16 Little or Big Indian (UTF16LE/UTF16BE) or
        - UTF32 Little or Big Indian (UTF32LE/UTF32BE).

2.      How to create an empty UTF8 file with Byte Order Mark?

3.      How to add the byte order mark to an already existing and non-empty UTF8 file without BOM?

4.      How to remove the byte order mark from an already existing and non-empty UTF8 file?

5.      How to merge two or more UTF8 files? (=appending its contents to an already existing and non-empty target UTF8 file)?

6.      How to append specific single lines of text in a UTF8 file?
         Case 1:  by use of predefined environmental variables or build-in command processor tools
         Case 2:  by user defined text or variables

7.      Special Cases: How to inject a UTF8 file at specific line positions?
_______________________________________________________________________________________________________________________

PREREQUISITES:
Ony a few of the solutions below can be done with windows build-in tools alone.

1.  In most cases, presence of the Unix StreamEDitor v. 407 from UnxUtils is required (sed.exe)
     The most important feature needed here is that it supports to write the hex expression of
     a character and the use of character classes/ranges in hex.
     There're no dependencies for usage of this version.
     Suggested solutions in this thread refer to this version, but any GNU SED should work as
     long as it is version 4.07 or above. But this has NOT been tested.

2.  Some template files are needed.
     To simplify this process three template files are attached in the *.zip archive at the end of this post.
     You could also copy and paste the contents of the code box below to a good text editor.
     Save this content as files with different encodings:
        once as UTF8 (UTF8.txt),
        once as UTF8BOM (UTF8BOM.txt) and
        once as UTF16LE (UTF16LE.txt).
     These will become the working files.

     Important:
     reopen these files with your text editor and ensure that they're
     really encoded as UTF8, UTF8BOM and as UTF16LE respectively.
     Furthermore, ensure that your text editor displays ALL special chars
     shown in the code box below correctly. If not, something went wrong
     and I suggest you to use the predefined files in the attachment.
     Never trust a text editor blindly regarding codepage settings.
     Sometimes they do silly, sometimes really unexpected things (silently) after closing.

3.  Special folder preparation is recommended
     Create a folder, let's say "D:\Test Änvironment欢迎".
     This exemplary folder name (space, umlaut and chinese unicode chars) is important
     to ensure that suggested solutions will really work under given circumstances.
     In fact, it makes a difference to solve task 6 if e.g. path names contain
     chars that may be mapped as single, double- or multibytes.

4.  put sed.exe, the three working files and an EMPTY batch file into this folder.

     (By using the content of the *.zip archive you only have to put sed.exe into it).

That's it.

Template text content (please note the two empty lines at the end of this box):

This is line one.
These are two chinese chars:欢迎 :MultiByte: three pairs of DoublByte in UTF8, two pairs of TripleByte in UTF16LE
These are german umlauts:ÄÖÜäöüß :SingleByte in cp 1252 and DoubleByte in UTF8
This is an accent acute:´        :SingleByte in cp 1252 and DoubleByte in UTF8
This is an accent circumflex:^   :SingleByte in cp 1252 and UTF8
This is line six.


SED v.4.07 from UnxUtils can be downloaded from different locations.
Single download:http://www.wzw.tum.d...g/win32/sed.exe.
Package at SourceForge:http://unxutils.sour.../UnxUpdates.zip.
Alternate link:http://www.weihenste.../UnxUpdates.zip.
_______________________________________________________________________________________________________________________
 

SOLVING TASK 1:
How to detect the file encoding (also works under XP)?

Query below will be needed several times in my examples and therefore is used as CALL.
It requires the input file and fills the requested output variable with a value range from 0-7.

This code is heavily commented. Feel free to remove comments for better readability.

Copy the code below into the empty batch file:

Spoiler

 

 

SOLVING TASK 2:
How to create an empty UTF8 file with Byte Order Mark  (also works under XP)?

Solution below avoids writing of glyphs into your text editor while creating a BOM.
Underlying (and experienced) reason is, that in regard to the codepage setting of
your text editor your batch file maybe (silently) converted to UTF8 after closing it.
Running batches that were accidentally encoded as UTF8 really lead to unexpected
behaviours/results and may cause data loss if a batch file content includes commands
such as RMDIR/RD, RENAME/REN, ERASE/DEL or COPY.
This example solution uses a CALL to create the UTF8-Header file.

Copy the code below into the empty batch file:

Spoiler

 

 

SOLVING TASK 3:
How to add the byte order mark to an already existing and non-empty UTF8 file without BOM (also works under XP)?

Copy the code below into the empty batch file (needs _GetFileEncoding):

Spoiler

 

 

SOLVING TASK 4:
How to remove the byte order mark from an existing and non-empty UTF8 file (also works under XP)?

Copy the code below into the empty batch file (needs _GetFileEncoding):

Spoiler

 

 

SOLVING TASK 5:
How to merge two or more UTF8 files? (=appending its contents to an already existing and non-empty target UTF8 file)?

Remarks:
-    for WindowsXP, provided solution only works by typing it manually into the
     console window. It does not work inside a batch because of codepage switching
     to cp 65001 which seems to break further batch file processing.
-    By merging UTF8 files together, an existing or non-existing BOM does not matter.
-    Switching to cp 65001 (UTF8) before merging is mandatory.

Copy the code below into the empty batch file (needs _GetFileEncoding):

Spoiler

 

 

SOLVING TASK 6:
This is somehow tricky buisiness.
At least, provided solutions work for systems in which ACP 1252 (Western Europe)
is installed. For Eastern Language systems and/or systems with multibyte support
provided steps must be adjusted maybe.

Remarks:
-    in fact, it makes a difference if you try to append predefined system variables
     and build-in commands of the command processor into a UTF8 file, or if you try to
     append user defined variables of text.
-    for WindowsXP, solutions provided here only work by typing them manually into
     the console window. They do not work inside a batch because of codepage switching
     to cp 65001 which seems to break further batch file processing.
-    an existing or non-existing BOM does not matter for these solutions.

Case 1:
How to append a single line of text in a UTF8 file by use of predefined
environmental variables or build-in command processor tools?


Although you will never be able to display the chinese unicode chars inside console
correctly, as long as you switch to codepage 65001 the command processer does
the correct unicode (UTF8) conversion for you to append text into a UTF8 file.

In this example the location of the batch file is appended to the target UTF8
file via build-in commands. These are "%~dp0", "CD" and "DIR".

I also want to know if a file path location is successfully written,
if the underlying path itself contains unicode characters.
Thats why I suggested to create such a folder structure as prerequisite.

Copy the code below into the empty batch file (needs _GetFileEncoding):

Spoiler

 

 

Case 2:
How to append a single line of text in a UTF8 file by user defined text or variables?

 

Two different scenarios are common:

Scenario 1:
While trying to append a single byte character or text sequence where position of each
char does not change in OEMCP 850, ACP 1252 and UTF8, nothing special happens.
Just switch to cp 65001 and append the text sequence.
Without switching to cp 65001, target file will be converted to ANSI.

Example below adds "AOUaou", immediately followed
by an accent graph and an accent circumflex.

Copy the code below into the empty batch file (needs _GetFileEncoding):

Spoiler

 

 

Scenario 2:
Previous solution definitively does not work if a single byte char
is involved that is epxressed as double byte char in the utf8 charset.
It also does not work, if its position in OEMCP 850 differs from the
systems default used ACP (cp 1252 for a german windows).
Examples are german umlauts or an accent acute.

Provided solution looks somehow weird.
Double byte text is stored as variable in the cp 1252 environment
Then, by switchig to cp 65001 this variable is appended to target UTF8 file.

Order of steps above MUST be exactly done as provided.
Omitting any of these steps or reordering will result in file corruption.

Any text stored as variable you're trying to append must be created under the
cp 1252 environment. Variables created before switching to cp 1252 must be re-initialized.

Remarks:
The first important step of this solution (switching to ACP 1252) seems to depend on the systems default ACP.
I cannot test if- and how this should be adjusted for Non-European default ACP's.

Copy the code below into the empty batch file:

Spoiler

 

 

SOLVING TASK 7:
How to inject a UTF8 file at a specific line position for substitution?

This can be done with SED.
For those, who are somehow familar with regular expressions (RE's) or maybe

used windows' FINDSTR in the past, provided syntax solution(s) are not difficult
to understand. For others, it's beyond the scope of this thread to describe
the command line syntax of SED in detail.

Here's some really simplified information:
SED processes input streams line by line.
backslashes in search- or replacement patterns (e.g. windows paths) must be escaped (doubled)
s    = substitute
p    = print
d    = delete
a    = append
i    = insert
\x = hex sequence
\d = dec sequence

Base substitution commands:
SED "s/SearchPattern/SubstStr/" "InputFile"        replaces the first occurance of the search pattern in each line with the substitution string
SED "3s/SearchPattern/SubstStr/" "InputFile"      replaces the first occurance of the search pattern in line 3 only with the substitution string
SED "3s/SearchPattern/SubstStr/g" "InputFile"    replaces all occurances of the search pattern in line 3 only with the the substitution string

Search pattern related print commands:
SED -n "/SearchPattern/p" "InputFile"                  print any line from input stream that matches the search pattern
SED -n "/SearchPattern/!p" "InputFile"                 print any line from input stream that does not match the search pattern

Search pattern related delete commands:
SED -n "/SearchPattern/d" "InputFile"                  delete any line from input stream that matches the search pattern
SED -n "/SearchPattern/!d" "InputFile"                delete any line from input stream that does not match the search pattern

Search pattern related append command:
sed "/Headline/ a -----" "InputFile"                        for lines containing the char sequence "Headline" append underlining from input stream

Insert command:
SED "2i TextToInsert" "InputFile"                          insert "TextToInsert" at line 2 [does not overwrite content of line 2]

A good starting points to learn some more about SED with practical examples
is here:http://sed.sourcefor...et/sed1line.txt
or here:http://www.pement.org/sed/sed1line.txt


To solve task 7, SED's insert command may be the favorite choice to inject user defined text to a file.

Unfortunately, this is not as easy for files that are encoded as UTF8.

Although switching to cp 65001 in a batch, SED itself breaks the UTF8 input stream when used.

This is true for redirecting and piping.

The only way (in my expericence) is to put the desired SED commands into a
SED-script file (a simple text file) and to invoke it via -f switch.
As long as the sed script has been created under the cp 65001 environment,
any write/substitute/delete/append or insert command works fine without file
corruption or reverting the target file back to ANSI - always.

In the example below, the unicode file path and double byte chars are added to the UTF8 file.
An existing or non-existing BOM does not matter.

Copy the code below into the empty batch file:

Spoiler

 

 

Any feedback (comments, suggestions, criticism) is more than welcome.

Zharif


 

Attached Files

  • Attached File  UTF8.zip   17.09KB   2 downloads

  • Nuno Brito likes this

#2 Nuno Brito

Nuno Brito

    Platinum Member

  • .script developer
  • 10560 posts
  • Location:boot.wim
  • Interests:I'm just a quiet simple person with a very quiet simple life living one day at a time..
  •  
    European Union

Posted 28 May 2020 - 09:15 AM

Good post!

 

:cheers:



#3 Wonko the Sane

Wonko the Sane

    The Finder

  • Advanced user
  • 15339 posts
  • Location:The Outside of the Asylum (gate is closed)
  •  
    Italy

Posted 28 May 2020 - 01:03 PM

Very good.  :thumbsup: 

 

Any feedback (comments, suggestions, criticism) is more than welcome.
 

 

Post also all the batches (possibly in a .zip file) we have a looong tradition of the board software (at each, §@ç#ing new version) botching posts, particularly those in CODE tags and/or this or that browser (and/or target "text" editor) making a mess of copy/paste, particularly when it comes to non 0-9 a-z A-Z characters, like - only as an example - using different characters for the apostrophe, quote and double quote.

 

:duff:

Wonko



#4 Zharif

Zharif

    Frequent Member

  • .script developer
  • 159 posts
  • Location:Germany
  •  
    Germany

Posted 29 May 2020 - 07:03 PM

Thanks,

 

as suggested, template files and batches has been added in a zip-archive.

Furthermode, first post has been updated by some minor modifications.

 

I'm really curious, if anybody would participate or contribute here.

Maybe content is not more than a side issue for many users.

As for my part, I'm really interested in any kind of new insight.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users