Handle Unicode in URLs: Slugs for Non-English Content
Creating URL slugs for non-English content raises questions that do not arise with English text: should you keep accented characters in Spanish or French slugs, or convert them to ASCII? How do you create slugs from Chinese, Japanese, or Korean text? What happens to Arabic or Hebrew in URLs? This guide addresses all of these scenarios, explains the technical realities of Unicode in URLs, and shows how a slug generator can handle these transformations correctly.
How URLs Handle Non-ASCII Characters
URLs were originally designed for ASCII characters — the 128 characters covering English letters, digits, and common punctuation. As the web expanded globally, the standard was extended to support Unicode characters through percent-encoding (also called URL encoding). Percent-encoding converts each non-ASCII byte into a percent sign followed by two hexadecimal digits. The Spanish word información becomes informaci%C3%B3n in a URL. This is technically valid and functional — browsers decode it and display the original text — but it creates URLs that are visually unwieldy, difficult to type, hard to share in print, and prone to copy-paste errors that break the encoding. Modern browsers display IDN (Internationalized Domain Names) and Unicode URL paths in their decoded form. If you visit a page with the URL es.example.com/seo/información, most modern browsers display it as es.example.com/seo/información in the address bar. However, the actual URL transmitted in HTTP requests is the percent-encoded version, and some systems, proxies, and link processors work with the encoded form. Google's crawlers handle percent-encoded URLs correctly and recognize that informaci%C3%B3n and información refer to the same path. Google also has documented that it can index non-ASCII URLs and that Unicode slugs in the local language can be appropriate for locally targeted content. The practical question is not whether Unicode URLs are valid — they are — but whether they are the best choice for your use case, given the tradeoffs between local search relevance, shareability, and technical compatibility.
Accented Characters: Transliterate or Keep?
For languages that use the Latin alphabet with diacritical marks — Spanish, French, Portuguese, German, Polish, Czech, Romanian, and many others — you have two viable options for URL slugs. Option 1: Transliterate (remove accents). Convert é to e, ü to u, ñ to n, ç to c, ø to o, ł to l. This produces ASCII-only slugs that are universally safe in all systems, never subject to encoding errors, and readable in any context. The downside is that transliterated slugs are slightly less readable in the target language — información becomes informacion, which is readable but technically incorrect in Spanish. Option 2: Keep native Unicode characters. Use the properly encoded Unicode slug — información with the accented ó. This is more authentic in the target language and can improve local search matching for users typing Spanish queries with accents. The SEO argument for keeping native characters is that Spanish-speaking users searching Google in Spain or Latin America often type with accents, and a URL that matches the accented form may have a slight relevance edge for those queries. However, Google is very good at query normalization and maps accented and non-accented versions of the same word to the same results in most cases. The practical argument for transliteration is operational simplicity. ASCII-only slugs never encounter encoding errors, work flawlessly in email links, print materials, SMS messages, and every link-handling system you will ever use. For content teams, avoiding accents in slugs eliminates a class of bugs entirely. For most sites, transliteration is the safer and more maintainable choice. For sites specifically targeting local audiences in countries where accents are standard — and particularly for sites where the brand identity is strongly tied to the native language — keeping native characters is a defensible and correct approach.
CJK and Non-Latin Scripts in URL Slugs
Chinese, Japanese, and Korean (CJK) text, as well as Arabic, Hebrew, Thai, Devanagari, and other non-Latin scripts, require a different approach because there is no straightforward transliteration mapping to Latin characters. For CJK content, the two main options are: use romanized transliteration (Pinyin for Chinese, Romaji for Japanese, Revised Romanization for Korean) or use the native Unicode characters percent-encoded. Romanized transliteration produces ASCII slugs that work universally but lose the original script. For a Chinese blog post about SEO, the title SEO优化技巧 might produce a slug of seo-youhua-jiqiao (using Pinyin romanization). This is readable to users who know Pinyin, searchable, and technically clean. Native Unicode slugs for CJK content — /seo/%E4%BC%98%E5%8C%96%E6%8A%80%E5%B7%A7 — are ugly in percent-encoded form but display correctly in modern browsers. Chinese, Japanese, and Korean search engines (Baidu, Naver, Yahoo Japan) can index these URLs and may provide a local search advantage. For Arabic and Hebrew, which are right-to-left scripts, URL slugs present additional display complexity. Percent-encoded Arabic URLs display correctly in browsers but are practically impossible to type or share. Romanized transliteration is the common practical choice for Arabic content on internationally accessible sites. For content targeting predominantly local audiences on platforms with strong local search engines, native script URLs may be the right choice. For content targeting a global or multilingual audience, romanized transliteration provides the best balance of accessibility and search relevance. The Slug Generator tool handles Unicode input by either stripping or transliterating non-ASCII characters to produce clean, ASCII-compatible slugs — the safest default for most multilingual deployments.
Internationalization Workflow for URL Slugs
Managing URL slugs for multilingual sites requires a systematic workflow to maintain consistency and avoid common internationalization errors. Define your Unicode handling policy before launching the site. Decide whether your site will use transliterated ASCII slugs or native Unicode slugs for each language, and document this decision in your site's style guide. Changing the policy mid-site requires migrating URLs with 301 redirects. For sites with multiple languages, each language version should have slugs in its own language (or transliterated from its own language), not English slugs applied across all versions. Using English slugs for Spanish pages — es.example.com/create-seo-slugs instead of es.example.com/crear-slugs-seo — reduces local search relevance. Use hreflang tags to explicitly connect corresponding pages in different languages. The hreflang annotation tells Google which language and region each URL targets, preventing it from treating your Spanish and English pages as duplicate content. For CMS platforms, configure the slug generation to use the localized title as the slug source. WordPress with WPML or Polylang, for example, can generate slugs from translated post titles rather than the English original. Configure the transliteration or Unicode handling at the plugin level to ensure consistent behavior across all content. Test each language's URL slugs across different browsers, operating systems, and link-sharing platforms. Copy a Unicode URL from your browser address bar and paste it into an email, a Slack message, and a social media post. Check that it appears correctly in each context. If any context breaks the encoding or makes the URL unreadable, consider switching to transliterated ASCII slugs for that language.
Frequently Asked Questions
- Does Google prefer native-language slugs over transliterated ASCII slugs?
- Google has stated that it can index both native Unicode URLs and ASCII-transliterated URLs and that neither approach is categorically preferred for ranking. For local search relevance, native-language slugs may have a slight advantage for queries typed with the same characters, since Google can match the URL words exactly. However, Google's query normalization handles character variations well in most cases. The choice should be driven by operational practicality and your audience's search behavior rather than a clear Google preference.
- What happens when someone copies and pastes a Unicode URL?
- When a Unicode URL is copied from a browser address bar, the browser typically copies the decoded display form (with actual Unicode characters). When this is pasted into most modern applications — browsers, email clients, chat apps — it usually works correctly because the receiving application re-encodes it properly. However, some older applications, CMS fields, and server configurations may not handle the conversion correctly, resulting in broken links or 404 errors. This is one of the main reasons teams choose ASCII-transliterated slugs: they are safe in every context without exception.
- How should I create slugs for content in right-to-left languages like Arabic or Hebrew?
- For Arabic and Hebrew content, the most practical approach for URL slugs is romanized transliteration. Transliterate the Arabic or Hebrew words into Latin characters using a consistent romanization standard, then apply standard slug formatting: lowercase, hyphens between words, no special characters. While native Arabic or Hebrew Unicode slugs are technically valid and display correctly in modern browsers, they are impractical for sharing in print, messaging apps, and contexts where right-to-left Unicode character sequences can cause display or encoding issues. Transliterated slugs are universally compatible and remain the standard for Arabic and Hebrew content on globally accessible sites.