Arabic has a tightly constrained syllable structure: every syllable opens with a single consonant and a vowel, and may be closed by a coda consonant. This analyzer parses fully vocalised Arabic into those syllables and labels each one by its CV pattern and weight.
How it works
The text is segmented character by character into consonants (C), short vowels (V), long vowels (VV), and codas:
- A consonant letter becomes an onset C.
- A short vowel (fatha, damma, kasra) is a V nucleus — unless it is followed by its matching madd letter (alef, waw, ya), in which case the pair is a long VV nucleus.
- A sukun marks the preceding consonant as a coda, closing the syllable.
- A shadda (gemination) splits into a coda on the previous syllable plus an onset on the next.
- Tanwin is read as a short vowel plus a final /n/ coda.
Syllables are then built greedily as C + nucleus + optional coda. Weight is
classified as light (CV), heavy (CVC, CVV), or superheavy (CVVC,
CVCC).
Example
The verb:
كَتَبَ (kataba)
parses as CV · CV · CV — three light open syllables. The word:
دَرْسَهُ (darsahu)
parses as CVC · CV · CV, where the sukun on the rāʾ closes the first syllable into a heavy CVC.
Notes
- Always vocalise the text; without harakat the vowels are invisible to any syllabifier.
- CVVC and CVCC superheavy syllables normally occur only at the end of a word.
- This is a phonological approximation for linguistics and teaching, not a complete prosodic engine.