Borbin the 🐱

C++ aggregate using a lambda

13 October, 2024


For the EV diff plugin in cPicture, which calculates the exposure difference between pictures, I needed a simple aggregate to print the parameters used.

For example, the list values 1, 2 and 3 should be converted to the string "(1, 2, 3)". An aggregate function maps an enumerable type (vector) to a plain type (string). The function takes the vector, the toString lambda to convert individual vector elements to a string, and the start/end text and the separator.

I wanted to create a function similar to the Join-String function in PowerShell that is both simple and versatile. This is my implementation:

template<class T>
CString JoinString(typename vector<T>::const_iterator begin,
    typename vector<T>::const_iterator end,
    const function<CString(typename T)>& toString,
    const CString& startText,
    const CString& endText,
    const CString& separator)
{
    CString text(startText);

    for (typename vector<T>::const_iterator it = begin; it != end; ++it)
    {
        // Use the lambda to convert the template type data to text.
        text += toString(*it);

        // Skip separator for last element.
        if (it != end - 1)
        {
            text += separator;
        }
    }

    return text + endText;
}


Test case for JoinString using int data.

The input is a vector of three ints which gets converted to a string that starts with an opening bracket '(', separates the elements with a comma ',' and ends with a closing bracket ')'. The toString lambda 'getText' converts the data to a string.

vector <unsigned int> matchList = { 1, 2, 3 };

// toString lambda expression
auto getText([](auto id) { CString text; text.Format(L"%d", id); return text; });

const CString matchParameter(JoinString<unsigned int>(matchList.begin(), matchList.end(), getText, L"(", L")", L", "));

Assert::AreEqual(L"(1, 2, 3)", matchParameter);


Test case for JoinString using string data.

The lambda simply passes the data and can be inlined.

vector <CString> strList = { L"a", L"b", L"c" };

const CString combinedText(JoinString<CString>(strList.begin(), strList.end(), [](auto text) { return text; }, L"[", L"]", L"+"));

Assert::AreEqual(L"[a+b+c]", combinedText);



Translate to any language using PowerShell

01 June, 2024


Google's online translation service is an easy way to translate text into any language.
For example:

[string]$text = "The picture."
online-translate "de" $text

The returned translation is "Das Bild."


Or:

[string]$text = "The picture search is finished. %1!d! pictures have been scanned and %2!d! duplicate pictures were found.\nWould you like to see the list with the duplicate picture names?"
online-translate "de" $text

The returned translation is "Die Bildsuche ist beendet. %1!d! Bilder wurden gescannt und %2!d! Es wurden doppelte Bilder gefunden. Möchten Sie die Liste mit den doppelten Bildnamen sehen?"

Please note the unmodified placeholders (Ensure to use the REST API setting 'client=gtx').


This is all done with a PowerShell script using the public Google REST API:

<#
.DESCRIPTION
    Translate any text for a language using the google online translation service.
#>
function online-translate([string]$language, [string]$text) {

    # Escape newlines.
    $text = $text.Replace("`n", '\n')

    # The google rest API.
    [string]$uri = "https://translate.googleapis.com/translate_a/single?client=gtx&tl=$language&q=$text&sl=auto&dt=t"
    [string]$response = (Invoke-WebRequest -Uri $uri -Method Get).Content

    # Combine the segments of the response to a single string.
    # Regex is rather simple: Use the start pattern '[[["', or the segment pattern ']]]],["'
    # to capture the sentences in the text group.
    $m = ([regex]'(?:(?:^\[\[\[)|(?:\]\]\]\],\[))"(?<text>.*?)",".*?",null').Matches($response)
    [string]$translation = ($m | % { $_.groups['text'].Value }) -join ""
    
    # Adjust the translated text.
    $translation.Replace('\"', '"').Replace('\\n', "`n").Replace('\\r', "`r").Replace('[TAB]', "`t").Replace('\u003c', '<').Replace('\u003e', ">").Replace('\u003d', "=").Replace('\\\\', "\\")
}


The REST API response is more complex than the call itself, but with a simple regex this problem is easily solved in the function.
The starting pattern '[[["' or the segment pattern ']]]],["' is used to capture the sentences in the text group.
The number of segments depends on the text input. For example:

Small text return a single segment response:

[[["Das Bild.","Das Bild.",null,null,5]],null,"de",null,null,null,1,[],[["de"],null,[1],["de"]]]


Larger text return a multi segment response:

[[["Die Bildsuche ist beendet. ","The picture search is finished.",null,null,3,null,null,[[]],[[["84d48e73ebfa38d4d681515b81e0b72a","en_de_2023q1.md"]]]],["%1!d! ","%1!d!",null,null,3,null,null,[[]],[[["84d48e73ebfa38d4d681515b81e0b72a","en_de_2023q1.md"]]]],["Bilder wurden gescannt und %2!d! ","pictures have been scanned and %2!d!",null,null,3,null,null,[[]],[[["84d48e73ebfa38d4d681515b81e0b72a","en_de_2023q1.md"]]]],["Es wurden doppelte Bilder gefunden.\\nMöchten Sie die Liste mit den doppelten Bildnamen sehen?","duplicate pictures were found.\\nWould you like to see the list with the duplicate picture names?",null,null,3,null,null,[[]],[[["84d48e73ebfa38d4d681515b81e0b72a","en_de_2023q1.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]

chatGPT result encoding

14 February, 2023


chatGPT returns the result as a UTF-8 byte sequence in text form. Anything but ASCII 7-bit chars, for example any extended chars, languages with other scripts, will result in not readable text.


For example a result returned for the Spanish language:

¿Qué habitaciones tienen disponibles?  

Expected result:

¿Qué habitaciones tienes disponibles?


Result returned for the Japanese language:

どの部屋が利用可能ですか?  

Expected result:

どの部屋が利用可能ですか? 


You need to read the result as iso-8859-1 encoding and convert as UTF-8.
For example 'é' gets encoded in UTF-8 as the byte sequence: 0xc3: 'Ã' 0xa9: '©'
But instead of 'é', chatGPT sends 'é', which is the raw UTF-8 byte sequence.
The string 'é' is a string sequence of the byte sequence 0xc3 0xa9. To get the correct Unicode string, the string elements needs to be mapped to byte elements.

[byte[]]$byteContent = [System.Text.Encoding]::GetEncoding("iso-8859-1").GetBytes($resultText)

This is done with the iso-8859-1 encoding. This will convert each char into a 8-bit representation, which then can be correctly decoded as UTF-8 to a Unicode string:


# Run chatGPT query.
$result = (Invoke-RestMethod @RestMethodParameter)

[string]$resultText = $result.choices[0].text
[byte[]]$byteContent = [System.Text.Encoding]::GetEncoding("iso-8859-1").GetBytes($resultText)

# Get the encoded result.
[string]$text = [System.Text.Encoding]::UTF8.GetString($byteContent)


Here is a full example on how to use chatGPT in PowerShell:


# https://platform.openai.com/account/api-keys
$apikey = "sk-....

<#
– Model [Required]
The ChatGPT got multiple models. Each model has its feature, strength point, and use case. You need to select one model to use while building the request. The models are:

text-davinci-003    Most capable GPT-3 model. It can do any task the other models can do, often with higher quality, longer output, and better instruction-following. It also supports inserting completions within the text.
text-curie-001      Very capable, but faster and lower cost than Davinci.
text-babbage-001    Capable of straightforward tasks, very fast, and lower cost.
text-ada-001        Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost
#>

$requestBody = @{
    prompt      = "What is the capital of Germany?"
    model       = "text-ada-001"
    temperature = 1
    stop        = "."
} | ConvertTo-Json

$header = @{ 
    Authorization = "Bearer $apikey " 
}

$restMethodParameter = @{
    Method      = 'Post'
    Uri         = 'https://api.openai.com/v1/completions'
    body        = $requestBody
    Headers     = $header
    ContentType = 'application/json'
}

# Run chatGPT query.
$result = (Invoke-RestMethod @restMethodParameter)

[string]$resultText = $result.choices[0].text
[byte[]]$byteContent = [System.Text.Encoding]::GetEncoding("iso-8859-1").GetBytes($resultText)

# Get the encoded result.
[string]$text = [System.Text.Encoding]::UTF8.GetString($byteContent)

Scan text with regex in PowerShell

24 April, 2022


The named group capture (?<name>exp) in a regex is an easy way to scan content. In this example, to get the text enclosed in quotes in a string. This is how it is done in PowerShell:

# Get the text enclosed in quotes.
[string]$text = 'This is an "example text".'
[string]$textRegex = '\"(?<Text>.*?)\"'

if ($text -match $textRegex) {
    $matches['Text']
}

This outputs
example text


Or split a formatted string into parts. For example the assignment structure 'id=value':

# Parse the id and value of the text.
[string]$text = '  id123 = abc  '
[string]$idValueRegex = "^\s*(?<id>\w+?)\s*=\s*`"?(?<value>.+?)`"?\s*$"

if ($text -match $idValueRegex) {
    "id=$($matches['id']), value=$($matches['value'])"
}

This outputs
id=id123, value=abc


Or parse a pattern, for example the content of each bracket in " abc { 123 } { def } 456 {xyz}"

[string]$text = " abc { 123 } { def } 456 {xyz}"
[string]$bracketRegex = "[{]\s*(?<Text>.*?)\s*[}]"

([regex]$bracketRegex).Matches($text) | % {
    [System.Text.RegularExpressions.Group]$match = $_
    [string]$value = $match.Groups["Text"].Value

    $value
}

This outputs
123
def
xyz


Using List in PowerShell

24 April, 2022


PowerShell has lots of array and lists support, but changing or creating a list with dynamic data recreate the list on each change which is inefficient for large lists.
The most simple solution is to use the .NET List class:

    [System.Collections.Generic.List[string]]$content = [System.Collections.Generic.List[string]]::new()

    $content.Add("line1")
    $content.Add("line2")

Text file encoding with PowerShell

24 April, 2022


Text files contain Text with a certain encoding. The usual symbols can be displayed with one byte and encoded as such in the file. But extended chars or other glyphs need more than one byte for representation. The standard for this is Unicode.


Common Unicode encodings are utf-8 and utf-16.
utf-8 encodes 7bit chars as it is and is one of the most used formats out there because it results in small file sizes as most text is 7bit anyway. All non 7bit chars are encoded with a sequence.
utf-16 uses the surrogate pairs to encode char points out of the basic plane, but for most cases it is 2 byte per char. Also known as 'Unicode' with the option for big/little endian order of the byte sequence. The .NET string class is also using utf-16 encoding. As with the file format, don't assume each char is two bytes.


The PowerShell functions Get-Content and Set-Content need an encoding to properly read/write the file.
Without any encoding, this loops through all bytes in the text file instead of the encoded chars, and the loop variable is only the byte part of the original encoding and not very useful.

# No encoding.
Get-Content $textFile | % { 
    $_
} | Set-Content $textFileOut


If the encoding is missing when the file is read, the original text content in utf-8:
😺abcパワーシェル
will be stored as this instead:
😺abcパワーシェル

# Encoding missing, wrong content in output file.
Get-Content $textFile | % {
    $_
} | Set-Content -Encoding UTF8 $textFileOut


The encoding is needed to properly read the chars in a text file:

# Read utf-8 file and write as utf-8.
Get-Content -Encoding UTF8 $textFile | % {
    $_
} | Set-Content -Encoding UTF8 $textFileOut


# Read utf-8 file and write as Unicode (utf-16).
Get-Content -Encoding UTF8 $textFile | % {
    $_
} | Set-Content -Encoding Unicode $textFileOut


Note: The Get-Content will read a unicode file even when the utf-8 encoding is used, but it won't read a utf-8 file when the unicode encoding is used. Do not rely on this.
But when the encoding is not known, it is difficult to use Get-Content. Best practice is to use the ReadLines API from .Net to read any file encoding:

# Read any file encoding and write as utf-8.
[System.IO.File]::ReadLines($textFile) | % { 
    $_
} | Set-Content -Encoding UTF8 $textFileOut


By default, Set-Content -Encoding UTF8 is not writing a BOM.
Use the Text.UTF8Encoding to control how if the BOM should be used.
If the Byte Order Mask (BOM) is not needed, use this to write out as utf-8 without BOM:

# Read any file encoding and write as utf-8 without BOM.
[string[]]$contentLines = [System.IO.File]::ReadLines($textFile)
[Text.UTF8Encoding]$encoding = New-Object System.Text.UTF8Encoding $false
[IO.File]::WriteAllLines($textFileOut, $contentLines, $encoding)

If the Byte Order Mask (BOM) is needed, set the first constructor arg of the encoding to $true:


[Text.UTF8Encoding]$encoding = New-Object System.Text.UTF8Encoding $true


The ReadLines API does not load all content into memory at once and allow for very large files to be processed line by line. If you need the file in one string, use this:

# Read text as one string with any file encoding and write as utf-8 without BOM.
[string]$content = [System.IO.File]::ReadAllText($textFile)
[Text.UTF8Encoding]$encoding = New-Object System.Text.UTF8Encoding $false
[IO.File]::WriteAllText($textFileOut, $content, $encoding)


XML files are also text files using an encoding. Most XML files use utf-8, but if the encoding is different, this commonly used code is not working anymore:

# Do not use.
[xml]$xml = Get-Content -Encoding UTF8 $xmlFile


Use this instead:

# Read XML file.
[xml]$xml = New-Object xml
$xml.Load($xmlFile)


The default output file encoding is utf-8 with a BOM:

# Save xml as utf-8 with signature (BOM).
$xml.Save($xmlFileOut)


To not write a BOM, use this:

# Save xml as utf-8 without BOM.
$encoding = [System.Text.UTF8Encoding]::new($false)
$writer = [System.IO.StreamWriter]::new($xmlFileOut, $false, $encoding)
$xml.Save($writer)
$writer.Dispose()

← Vorherige Beiträge