Counting Urdu words with a generic word counter gives wrong results because of one quirk: compound words are stitched together with an invisible joiner rather than a space. This counter understands that convention and counts the way a reader actually sees the words.
How it works
Before counting, the text has every zero-width non-joiner (U+200C) stripped, so a
compound such as کتابخانہ collapses to a single token instead of splitting at
the joiner. The cleaned text is then split on a class of separators:
whitespace + ۔ ، ؛ ؟ + . , ; : ! ? ( ) + quotes
Empty tokens from consecutive separators are discarded. Sentences are counted by splitting on the Urdu full stop ۔ plus question and exclamation marks. Character counts are reported with and without whitespace.
Example and notes
The line یہ ایک کتابخانہ ہے۔ آپ کیسے ہیں؟ counts as five words and two
sentences — the ZWNJ inside کتابخانہ does not inflate the total. Note that the
zero-width non-joiner is invisible, so two pieces of text that look identical can
have different naive word counts; this tool normalises that difference away.