Unicode and the String Length Lie: Why "hello".length Is Wrong

Open your browser console right now and type: "👨‍👩‍👧‍👦".length. JavaScript will tell you 11. There are not 11 characters in that string. There is one: a family emoji made of four people. This is not a quirk. It is a fundamental misunderstanding baked into how most programming languages handle strings — and it causes real production bugs.

1. Before We Begin: What Even Is a Character?

This sounds like a philosophy seminar question, but it is not. It is the root of every Unicode bug you will ever encounter. The answer depends on which of three different things you mean:

A Unicode code point — a number in the range U+0000 to U+10FFFF. There are about 150,000 defined so far. The letter "A" is U+0041. The emoji 🎉 is U+1F389. The Devanagari character "क" is U+0915.
A code unit — the actual bytes in memory. UTF-8 uses 1–4 bytes per code point. UTF-16 uses 2 or 4 bytes (2 code units of 16 bits each for characters above U+FFFF).
A grapheme cluster — what a human sees as one character. This can be made of multiple code points: a base character + combining accent marks, or multiple emoji joined by a Zero Width Joiner (ZWJ).

The family emoji: a crash course in how bad it gets

👨‍👩‍👧‍👦 is made of 7 code points: Man (U+1F468) + ZWJ (U+200D) + Woman (U+1F469) + ZWJ + Girl (U+1F467) + ZWJ + Boy (U+1F466). Rendered on screen: one visible character. In JavaScript's .length: 11 (each emoji above U+FFFF uses 2 UTF-16 code units, plus 3 ZWJs at 1 unit each = 8+3 = 11).

2. How JavaScript Counts (and Why It's Not Actually Wrong)

JavaScript strings are encoded as UTF-16. The .length property returns the number of UTF-16 code units. This was a perfectly reasonable design choice in 1995 when Unicode was still young and the assumption was "BMP-only" (Basic Multilingual Plane, U+0000–U+FFFF). All the CJK characters, all Latin scripts, all Cyrillic, all Indic scripts — they all fit in BMP. One code unit = one code point. Easy.

Then emoji happened. Emoji live above U+FFFF (the "supplementary planes"). Each one takes 2 UTF-16 code units — called a surrogate pair. JavaScript's string engine handles them correctly for rendering, but .length still counts raw code units.

// Regular letters — length matches what you'd expect
"hello".length         // 5 ✓
"नमस्ते".length        // 6 ← not 3 syllables, but 6 code points
"café".length          // 4 or 5 depending on normalisation

// Emoji — each above BMP costs 2 code units
"🎉".length            // 2 ✗ (one emoji, two code units)
"🏳️‍🌈".length          // 6 ✗ (flag + ZWJ + rainbow, rendered as one)
"👨‍👩‍👧‍👦".length        // 11 ✗ (family, rendered as one)

// Iterating is safer — spreads by code point
[..."🎉"].length        // 1 ✓
[..."👨‍👩‍👧‍👦"].length    // 7 ✗ (correct code points, but still not "1 character")

3. The Indian Languages Problem Is Different

Emoji are visible and annoying when they break. Indic script bugs are invisible and catastrophic. If your app takes user input in Hindi, Bengali, Tamil, or Telugu, you need to understand how these scripts encode in Unicode.

Devanagari (Hindi) uses a combination of base consonants + vowel signs (matras) + halant (virama) to form visually distinct syllables. "नमस्ते" looks like 3.5 characters (na-ma-ste) but has 6 Unicode code points:

न  = U+0928  (NA)
म  = U+092E  (MA)
स  = U+0938  (SA)
्  = U+094D  (VIRAMA — the halant joining SA + TA)
त  = U+0924  (TA)
े  = U+0947  (vowel sign E)

"नमस्ते".length   // 6 (correct code points)
[..."नमस्ते"]     // ['न','म','स','्','त','े'] — 6 elements

Now imagine your UI shows a character counter that says "6 characters" when the user typed 3 syllables. Or your backend validates "max 10 characters" and rejects a 5-syllable Hindi name because it's "11 characters". Both happen constantly in production code written by developers who never tested with non-Latin input.

Username validation trap

If your backend does if (username.length > 20) reject() and your frontend shows a character counter, they may disagree when the username contains Devanagari, Tamil, or emoji. The user sees "8 characters". Your backend rejects it as "16 characters". Nobody knows why. This is a real accessibility and inclusivity issue, not just a technical one.

4. The Normalisation Ambush

Here's a fun one. Open your terminal and paste this:

// "café" — two different byte sequences, identical appearance
const a = "café";         // é as U+00E9 (precomposed)
const b = "café";   // e + combining acute accent (U+0301)

a === b        // false
a.length       // 4
b.length       // 5
a.normalize() === b.normalize()  // true ✓

This is Unicode Normalisation. There are two ways to represent many accented characters: precomposed (single code point) or decomposed (base + combining mark). Both look identical. They are not equal without normalisation.

Where this bites you: password hashing. A user sets their password on an iPhone (which sometimes normalises to NFC) and tries to log in from an Android (which might not normalise at all). Same password, different bytes, different hash, login fails. This is a real bug that has affected many production systems.

❌ Comparing raw strings

function login(input, stored) {
  return hash(input) === stored;
  // Fails if normalisation differs
}

✅ Normalise before hashing

function login(input, stored) {
  return hash(input.normalize('NFC'))
         === stored;
  // Consistent regardless of source
}

5. The Right Tool: Intl.Segmenter

ECMAScript 2022 shipped Intl.Segmenter, which is the correct way to split strings into what humans perceive as characters. It understands grapheme clusters, not just code points.

const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });

function realLength(str) {
  return [...seg.segment(str)].length;
}

realLength("hello")          // 5 ✓
realLength("🎉")             // 1 ✓
realLength("👨‍👩‍👧‍👦")         // 1 ✓
realLength("नमस्ते")          // 3 ✓ (three visual units)
realLength("🏳️‍🌈")           // 1 ✓

// Character counter for a UI
function charCounter(input, max) {
  const len = realLength(input);
  return `${len}/${max} ${len > max ? '⚠️' : ''}`;
}

charCounter("Hello 👋", 10)  // "7/10"

Browser support

Intl.Segmenter has full support in Chrome 87+, Safari 14.1+, Firefox 78+. Node.js 16+. It is available everywhere your users are, in 2026. Use it. The graphemer npm package is a polyfill if you need Node 12 support for some reason.

6. Python and Java Are Not Safe Either

Python 3 uses Unicode code points as its native string type. len("🎉") returns 1 (correct!). But len("👨‍👩‍👧‍👦") returns 7 (wrong — that is 7 code points, not 1 grapheme cluster). Python is better than JavaScript but still not grapheme-aware by default.

s = "👨‍👩‍👧‍👦"
len(s)      # 7 (code points) — Python doesn't lie as badly

# For true grapheme counting in Python:
import regex  # third-party, not re
graphemes = regex.findall(r'\X', s)
len(graphemes)   # 1 ✓

# Also works for Devanagari
len("नमस्ते")          # 6 (code points)
len(regex.findall(r'\X', "नमस्ते"))  # 3 ✓

Java is UTF-16 internally, just like JavaScript. "🎉".length() returns 2. You need str.codePointCount(0, str.length()) for code points, and the ICU4J library for proper grapheme cluster handling.

7. Database Storage: The VARCHAR Trap

If you are using MySQL with utf8 charset (not utf8mb4), your database cannot store emoji at all. MySQL's utf8 is 3-byte UTF-8, which covers only BMP. Emoji are 4-byte. The insert silently truncates or errors depending on your error mode.

-- MySQL: always use utf8mb4, not utf8
CREATE TABLE users (
  name VARCHAR(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

-- PostgreSQL: uses proper UTF-8 by default, no gotcha here
-- VARCHAR(50) counts characters (code points), not bytes in PG

Note that PostgreSQL's VARCHAR(50) counts code points, not bytes and not grapheme clusters. So a 50 code point limit still allows 5 family emoji (each is 7 code points) which render as only 5 visible characters. You may want to enforce limits on the application side using Intl.Segmenter and let the database handle just the storage.

8. A Practical Checklist for 2026

What to actually do

Use Intl.Segmenter for any user-facing character count or limit in the browser
Normalise to NFC before storing any text that users type (passwords, names, usernames)
Use utf8mb4 in MySQL — there is no reason to use utf8 in 2026
In Python, use the regex module (not re) for \X grapheme cluster matching
Test your forms with: é (precomposed), नमस्ते (Devanagari), 👨‍👩‍👧‍👦 (ZWJ sequence)
Never truncate strings at a byte boundary — always at a grapheme cluster boundary
When slicing strings for display (e.g. "first 50 chars"), use Intl.Segmenter to segment, then take the first N segments

9. Why This Is More Than a Niche Problem

India has 22 scheduled languages and hundreds of millions of smartphone users whose names, cities, and content exist natively in scripts that are not ASCII. The WhatsApp groups that run this country's commerce are full of emoji. If you are building for Indian users — or any global audience — and you have not thought about this, you have real bugs in production right now.

The fix is not hard. Intl.Segmenter is a one-liner. NFC normalisation is a one-liner. The awareness, though, requires understanding why "hello".length has always been a simplification that breaks the moment you step outside ASCII.

JSON Validator & Formatter

Need to inspect or validate string data? Use our client-side JSON tool — nothing leaves your browser.

Open JSON Validator

Tools-Hut

Unicode and the String Length Lie: Why `"hello".length` Is Wrong