The Delphi Geek: When is AnsiString(String(ansstr)) <> ansstr?

Saturday, November 15, 2025

When is AnsiString(String(ansstr)) <> ansstr?

Remember the good old times when strings were composed of 8-bit characters? Those were simpler times. When we needed a data buffer, we just used an AnsiString. An 8-bit character is just a byte, surely?

Then Unicode arrived and suddenly all strings were based on 16-bit characters. The code broke all around us. At the company I'm working for we needed about one year to port all our applications to Unicode. (And by “we” I mean myself. I allocated one day per week to work on Unicode issues and slowly worked through the code. Today I'd probably use AI for the task. :) )

We completed this port in Delphi 2010 times. And then we spent the next 10-ish Delphi versions fixing bugs related to this change. And when I thought we had fixed them all (hope, hope), Microsoft did something unexpected.

Some time ago one of our applications started experiencing occasional data corruption — but only on the newest Windows Server 2025. We needed quite some time to find the reason, namely that the new OS version uses UTF-8 for the system code page by default.

Why is that a problem? Well, as it turned out, our code still wasn't completely converted to Unicode. There was still some data buffer that was declared as an AnsiString. It was sent through various functions, was converted to a Unicode string at one location and back to AnsiString at another. And that was enough to destroy the data.

In short, the code did this:

var ansstr := SomeData();
Process(AnsiString(string(ansstr)));

You may think to yourself — so what? Data got converted to Unicode according to the system locale and then it got converted back. Surely that should not destroy the data?

Let me introduce a simple application that points directly to the source of the problem.


program TestAnsiBuffer;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.SysUtils;

function DumpBuffer(const astr: AnsiString): string;
begin
  Result := Format('[%d] ', [Length(astr)]);
  for var ch in astr do
    Result := Result + Format('%.2x', [Ord(ch)]);
end;

procedure Check(const astr: AnsiString);
begin
  Write('Original: ', DumpBuffer(astr));
  Writeln(', Converted: ', DumpBuffer(AnsiString(string(astr))));
end;

begin
  Check(#$41#$42#$43);
  Check(#$01#$02#$03);
  Writeln('---');
  Check(#$e2#$28#$a1);
  Check(#$e2#$82#$28);
  Check(#$c3#$28);
end.

[The code is also available at https://github.com/gabr42/GpDelphiCode/blob/master/TestAnsiBuffer/TestAnsiBuffer.dpr]

On my development computer this produces:


Original: [3] 414243, Converted: [3] 414243
Original: [3] 010203, Converted: [3] 010203
---
Original: [3] E228A1, Converted: [3] E228A1
Original: [3] E28228, Converted: [3] E28228
Original: [2] C328, Converted: [2] C328

In short, all is well. All converted buffers are the same as the originals.

On a computer with the problematic option enabled, I get this:


Original: [3] 414243, Converted: [3] 414243
Original: [3] 010203, Converted: [3] 010203
---
Original: [3] E228A1, Converted: [7] EFBFBD28EFBFBD
Original: [3] E28228, Converted: [4] EFBFBD28
Original: [2] C328, Converted: [4] EFBFBD28

While the first two buffers are just fine, the latter three are all corrupted.

If you look carefully, you'll notice that in all bad buffers we get the same sequence: EF BF BD. Search for it on the Internet and you'll see that this is the UTF-8 representation of �, also known as the replacement character.

Now it gets clearer. While all five test sequences represent valid binary data, they don't all represent valid UTF-8 encodings. In fact, the latter three were specially constructed so that they represent short invalid UTF-8 sequences. Somewhere in the conversion process (and I didn't bother searching for the specific place) Windows says “oh, this is not a valid UTF-8 sequence; let's replace it with a replacement character.”

Fixing this problem is easy. Just remove the incorrect casting. (And while doing that, change the buffer from AnsiString to something more appropriate.) Finding the problem — well, that was hard.

If you'd like to test your applications, you can find this setting at Settings > Time & language > Language & region > Administrative language settings > Change system locale. Then check Beta: Use Unicode UTF-8 for worldwide language support.

Or, if you have a relatively new Windows 11 setup, you'll find this option directly on the Language & region page.

Enabling this option on my Windows 11 machine, however, produces a different result than on the server:


Original: [3] 414243, Converted: [3] 414243
Original: [3] 010203, Converted: [3] 010203
---
Original: [3] E228A1, Converted: [5] C3A228C2A1
Original: [3] E28228, Converted: [6] C3A2E2809A28
Original: [2] C328, Converted: [3] C38328

Maybe there's a good reason why this setting is still in beta?

14 comments:

Anonymous21:11
It is even worse. Unicode is not about 16-bit characters. UCS4 is 32-bit, and some glyphs need surrogates, so two 16-bit WideChar to encode a single UCS4 codepoint...
ReplyDelete
Replies
Anonymous15:20
But it is not enabled by default in Windows Server 2025. Are you sure this is a fresh install without changes?
ReplyDelete
Replies
Anton Alisov08:07
Very important to know, is this option (Use Unicode UTF-8) enabled by default in Windows 2025 Server clean install or not, and if yes, since what version or what KB Update. I definetely believe Microsoft won't change this silenlty without any notice, because it surely break some old programs.
ReplyDelete
Replies
Anton Alisov09:47
Not much info as of current time.

https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

Interesting quote from the article above:
-A vs. -W APIs

Win32 APIs often support both -A and -W variants.

-A variants recognize the ANSI code page configured on the system and support char*, while -W variants operate in UTF-16 and support WCHAR.

Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, then -A APIs typically operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.
ReplyDelete
Replies
Anton Alisov22:49
DeepSeek:

Enabling the **"Beta: Use Unicode UTF-8 for worldwide language support"** option in Windows significantly changes how the system handles character encoding, which can impact Win32 native applications in several ways. Here's a detailed breakdown:

---

### **1. What This Setting Does**
- **System Locale Change**: It configures Windows to use **UTF-8 as the default ANSI code page** (CP_ACP) system-wide, replacing legacy locale-specific code pages (e.g., Windows-1252 for English systems).
- **ANSI API Behavior**: Win32 APIs that use "A" versions (ANSI) will now interpret strings as UTF-8 instead of the legacy 8-bit code page.

---

### **2. Effects on Win32 Applications**

#### **A. Well-Behaved Unicode-Aware Applications**
- **Minimal Impact**: Applications that:
- Use "W" versions of Win32 APIs (e.g., `CreateFileW`).
- Explicitly handle UTF-8 in ANSI functions (e.g., `SetConsoleOutputCP(65001)`).
- Are built with Unicode character sets (common in modern apps).
- **Benefit**: UTF-8 compatibility improves interoperability with cross-platform tools, file systems, and networks.

#### **B. Legacy ANSI Applications**
- **Potential Issues**:
- **String Corruption**: If an app assumes a fixed 1-byte-per-character encoding (e.g., Windows-1252), UTF-8 multi-byte characters may break parsing/logic.
- **API Misbehavior**: ANSI APIs like `CreateFileA` might fail with non-ASCII paths if the app doesn’t expect UTF-8.
- **Buffer Overruns**: Incorrect buffer size calculations (assuming 1 byte = 1 character) can lead to crashes.
- **Example**: A tool using `strlen` on UTF-8 strings may miscalculate lengths, causing memory issues.

#### **C. Mixed ANSI/Unicode Usage**
- **Inconsistencies**: Apps mixing "A" and "W" APIs might see mismatched string interpretations.
- **Global State Dependencies**: Libraries relying on the system code page may behave unexpectedly.

---

### **3. Technical Details**
- **Code Page 65001**: The system uses UTF-8 (CP_UTF8) for:
- ANSI Win32 APIs (e.g., `MessageBoxA`).
- C runtime functions like `fopen` (if `_setmbcp(65001)` is used).
- **File Systems**: UTF-8 paths become accessible to ANSI APIs without conversion layers.
- **Console**: The Windows console may better handle UTF-8 output (though limitations remain).

---

### **4. Common Compatibility Problems**
- **Legacy Software**: Older apps/games using ANSI with hardcoded code pages may exhibit garbled text or crashes.
- **Third-Party Libraries**: Non-Unicode libraries might malfunction if they assume a single-byte code page.
- **Language-Specific Issues**: Apps optimized for East Asian languages (e.g., Shift-JIS) often break.

---

### **5. When to Enable This Setting**
- **Recommended For**:
- Developers working with cross-platform code (e.g., Linux/macOS compatibility).
- Environments heavily using UTF-8 (e.g., web services, modern file systems).
- **Avoid If**:
- You rely on legacy business software not tested with UTF-8.
- You use older multimedia tools with ANSI assumptions.

---

### **6. Testing and Mitigation**
- **Test Thoroughly**: Validate app behavior with UTF-8 enabled, especially for file I/O, UI rendering, and network communication.
- **Update Code**: Migrate to "W" APIs or explicitly handle UTF-8 in ANSI paths.
- **Rollback**: Disable the setting if critical apps malfunction.

---

### **Summary**
Enabling UTF-8 system locale modernizes Windows' encoding handling but risks breaking legacy Win32 apps not designed for UTF-8. While beneficial for future-proofing, test compatibility with your specific applications before deploying system-wide.
ReplyDelete
Replies
Anton Alisov22:57
This options looks like another "quick patch" from microsoft , trying to solve complex problems, which indeed need far more effort to solve them correctly (not by just changing all "A" variants of old WinAPI functions behavior to treat strings as multi-byte UTF-8 instead of 1-byte ANSI).
ReplyDelete
Replies
David Heffernan08:37
I mean, this is just a bug in your code, admittedly you were led there by the classic Delphi anti pattern of using a text data type to hold binary data.

Having UTF8 as the ANSI code page is very powerful though. And Delphi AnsiString suddenly becomes relevant again in that setting.
ReplyDelete
Replies
Tommi Prami07:36
That Beta setting, really can mess things up. I think we could not connect to Firebird Database, if I recall.

I think I've raised that this setting could cause problems few years ago. Did not make any analysis of it though, so most likely everyone just ignored it. Which is My bad ;)
ReplyDelete
Replies

Add comment

Pages

Saturday, November 15, 2025

When is AnsiString(String(ansstr)) <> ansstr?

14 comments: