Why normalise alef forms?

Arabic has several alef variants — أ إ آ ٱ — that users type inconsistently or omit the hamza from entirely. Folding them all to the bare alef ا lets a search for احمد match أحمد, إحمد, and آحمد without depending on exact spelling.

Which characters does it fold?

It maps alef with hamza above, alef with hamza below, alef with madda, alef wasla, and the wavy/high-hamza alefs to the plain alef ا. With the extra option on it also folds alef-maksura ى to ya ي and ta-marbuta ة to ha ه, which most Arabic search stacks do.

Does it remove vowel marks?

Optionally. The strip-tashkeel toggle removes harakat and other combining diacritics so that vowelled and unvowelled spellings compare equal. Leave it off if you need to preserve fully vocalised text.

Should I normalise the index and the query the same way?

Yes. Matching only works if both sides pass through an identical pipeline. Fold the stored terms and each incoming query with the same options, otherwise a normalised index term will not match an un-normalised query.

Is anything uploaded?

No. The mapping table is bundled with the page and all conversion runs in your browser. Nothing you paste is sent anywhere.

What is the Arabic Alef Normalizer?

Converts every Arabic alef variant including alef-hamza, alef-madda and alef-wasla to plain alef ا for case-insensitive search normalization, with optional ya/ta-marbuta folding and tashkeel stripping. It runs free in your browser on Gera Tools, with nothing uploaded.

Arabic Alef Normalizer

Name: Arabic Alef Normalizer
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Arabic alef comes in several forms that users type inconsistently, which breaks naive text matching. This normaliser folds every alef variant to the bare alef ا so that searches succeed regardless of which alef was typed.

How it works

The tool replaces each alef variant with the plain alef, then optionally applies the wider folding that search engines commonly use:

أ إ آ ٱ ٲ ٳ ٵ   ->  ا   (all alef variants to bare alef)
ى  ->  ي           (optional: alef-maksura to ya)
ة  ->  ه           (optional: ta-marbuta to ha)
+ optionally strip tashkeel (harakat / combining marks)

The first rule is the core: regardless of hamza, madda, or wasla, the result is one canonical alef. The extra ya/ta-marbuta folds and the tashkeel strip mirror what production Arabic search indexes do, so enabling them maximises recall.

Why this problem is surprisingly common

There are at least seven Unicode code points that visually represent an alef, and users produce different ones depending on their keyboard, device, and whether they are typing quickly or carefully. Consider the name Ahmad (أحمد): a trained typist on an Arabic keyboard will include the hamza above the alef, but a user typing a search query on a phone will almost always type the bare alef (احمد). Without normalisation, those two spellings are invisible to each other in a database.

The problem compounds with names and words at the start of sentences, in titles, or after punctuation, because those are positions where hamza placement is more varied. Arabic OCR tools and speech-to-text systems also produce inconsistent alef forms depending on the engine.

Worked example

A product database stores the name إبراهيم with an alef with hamza below. A user searches for ابراهيم with a bare alef. Without normalisation: zero results. With normalisation on both sides: both become ابراهيم, and the match succeeds.

The same logic applies to آسيا (Asia, alef with madda) versus اسيا (bare alef), and to any word beginning with ٱ (alef wasla), such as the definite article forms in classical Arabic texts.

When to enable the additional options

Alef-maksura to ya (ى → ي): Enable when matching Egyptian or mixed-dialect Arabic text. Alef-maksura appears at the end of many common words (على، إلى، متى) and users routinely confuse it with the standard ya when typing on mobile keyboards.

Ta-marbuta to ha (ة → ه): Enable for broad recall. This fold merges feminine endings, which some users omit or misspell. Disable it when searching for words where the distinction is semantically meaningful (مدينة vs. مدينه are the same word but different spellings; رقبة “neck” vs. رقبه are interchangeable in practice).

Strip tashkeel: Enable for any corpus that mixes vowelled and unvowelled text, such as a database that contains both Quranic verses and modern news articles. Disable it in applications that specifically need to distinguish fully vocalised words.

Implementation note

Apply the normaliser to both your indexed content and each incoming query using identical settings. A mismatch — for example normalising stored terms but not queries — produces the same failure as no normalisation at all. The output of this tool is the correct search key; store and use the original text for display purposes.