Borbin the 🐱

Text file encoding with PowerShell

24 April, 2022


Text files contain Text with a certain encoding. The usual symbols can be displayed with one byte and encoded as such in the file. But extended chars or other glyphs need more than one byte for representation. The standard for this is Unicode.


Common Unicode encodings are utf-8 and utf-16.
utf-8 encodes 7bit chars as it is and is one of the most used formats out there because it results in small file sizes as most text is 7bit anyway. All non 7bit chars are encoded with a sequence.
utf-16 uses the surrogate pairs to encode char points out of the basic plane, but for most cases it is 2 byte per char. Also known as 'Unicode' with the option for big/little endian order of the byte sequence. The .NET string class is also using utf-16 encoding. As with the file format, don't assume each char is two bytes.


The PowerShell functions Get-Content and Set-Content need an encoding to properly read/write the file.
Without any encoding, this loops through all bytes in the text file instead of the encoded chars, and the loop variable is only the byte part of the original encoding and not very useful.

# No encoding.
Get-Content $textFile | % { 
    $_
} | Set-Content $textFileOut


If the encoding is missing when the file is read, the original text content in utf-8:
😺abcパワーシェル
will be stored as this instead:
😺abcパワーシェル

# Encoding missing, wrong content in output file.
Get-Content $textFile | % {
    $_
} | Set-Content -Encoding UTF8 $textFileOut


The encoding is needed to properly read the chars in a text file:

# Read utf-8 file and write as utf-8.
Get-Content -Encoding UTF8 $textFile | % {
    $_
} | Set-Content -Encoding UTF8 $textFileOut


# Read utf-8 file and write as Unicode (utf-16).
Get-Content -Encoding UTF8 $textFile | % {
    $_
} | Set-Content -Encoding Unicode $textFileOut


Note: The Get-Content will read a unicode file even when the utf-8 encoding is used, but it won't read a utf-8 file when the unicode encoding is used. Do not rely on this.
But when the encoding is not known, it is difficult to use Get-Content. Best practice is to use the ReadLines API from .Net to read any file encoding:

# Read any file encoding and write as utf-8.
[System.IO.File]::ReadLines($textFile) | % { 
    $_
} | Set-Content -Encoding UTF8 $textFileOut


If the Byte Order Mask (BOM) is not needed, use this to write out as utf-8 without BOM:

# Read any file encoding and write as utf-8 without BOM.
[string[]]$contentLines = [System.IO.File]::ReadLines($textFile)
[Text.UTF8Encoding]$encoding = New-Object System.Text.UTF8Encoding $false
[IO.File]::WriteAllLines($textFileOut, $contentLines, $encoding)


The ReadLines API does not load all content into memory at once and allow for very large files to be processed line by line. If you need the file in one string, use this:

# Read text as one string with any file encoding and write as utf-8 without BOM.
[string]$content = [System.IO.File]::ReadAllText($textFile)
[Text.UTF8Encoding]$encoding = New-Object System.Text.UTF8Encoding $false
[IO.File]::WriteAllText($textFileOut, $content, $encoding)


XML files are also text files using an encoding. Most XML files use utf-8, but if the encoding is different, this commonly used code is not working anymore:

# Do not use.
[xml]$xml = Get-Content -Encoding UTF8 $xmlFile


Use this instead:

# Read XML file.
[xml]$xml = New-Object xml
$xml.Load($xmlFile)


The default output file encoding is utf-8 with a BOM:

# Save xml as utf-8 with signature (BOM).
$xml.Save($xmlFileOut)


To not write a BOM, use this:

# Save xml as utf-8 without BOM.
$encoding = [System.Text.UTF8Encoding]::new($false)
$writer = [System.IO.StreamWriter]::new($xmlFileOut, $false, $encoding)
$xml.Save($writer)
$writer.Dispose()