Vocabug
docs
Version: 0.2.0
Contents
- About Vocabug
- Interface
- Using comments
- About graphemes
- Categories
- Building words
- Default distributions
- Assigning weights
- Alphabetisation
- Defining graphemes
- Transform
- The change
- Insertion and deletion
- The condition
- The exception
- Using categories
- Alternator and Optionalator
- Cluster-field
- Wildcard, repetition and positioning
- Advanced rules
- Questions and answers
1About Vocabug
This is the complete documentation for Vocabug, version: 0.2.0. Vocabug randomly generates vocabulary from a given definition of graphemes and word patterns. It can be used to generate words for a constructed language, original nicknames or passwords, or just for fun.
This word generator is designed to be a successor to the Williams' Lexifer and to the legendary Awkwords. You can find Vocabug's repository here. If you want a "modern" user interface, albeit with limited features, check out Vocabug-lite.
2Interface
- The textbox at the top of the program is the
definition-build editor
. A definition-build defines the graphemes, frequencies, word-shapes and transforms that generate the final words. There will already be a default definition-build in the definition-build editor, or the previous definition-build that generated words - Use the
Generate
button to see Vocabug produce words - Use the
Copy
button to copy the words to the clipboard - Use the
Clear
button to clear the definition-build editor and the generated words
2.1Options
- Use the
Number of words
textbox to choose the number of words to generate. The default number is 100 Word-list mode
will produce a list of wordsParagraph mode
will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each sentenceDebug mode
will show, line by line, each step in creating each wordEditor wrap lines
will make the definition-build editor jump to the next line if the line escapes the width of the definition-build editorShow keyboard
will reveal a 'keyboard', a character selector, below the editorRemove duplicates
will make sure all words generated are uniqueForce words
will force the generator to try and generate the complete number of words requested within 30 seconds, despite the number of rejections / duplicates removedSort words
andCapitalise words
should be self explanatory- The
Word divider
textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space "\n
to get one word for each line
2.2File save / load
- Use the
Save
button to download the definition-build as a file called 'vocabug.txt', or whatever you named your file in theFile name:
field. The file is always a ".txt" type - Use the
Load
button to load a file from your system into the definition-build editor - Use the buttons in the
Examples
dropdown to load an example into the definition-build editor
3Using comments
If a line contains a semicolon ;
everything after it on that line is ignored and not interpreted as Vocabug syntax -- unless ;
is escaped. You can use this to leave notes about what something does or why you made certain decisions.
4About graphemes
Graphemes are indivisible meaningful characters that make a generated word in Vocabug. Phonemes can be thought of as graphemes. If we use English words sky
and shy
as examples to illustrate this, sky
is made up by the graphemes s
+ k
+ y
, while shy
is made up by sh
+ y
.
4.1Null grapheme
If a word is built using the syntax character ^
or ∅
, it will disappear in the generated word. In other words ^
is a null grapheme. If you want to use ^
as a grapheme, you will need to escape it. To use other syntax characters as graphemes, they must be escaped too.
4.2Escaping characters
A single-length character following the syntax character \
ignores any meaning it might have had in the generator, including backslashes themselves. This way, anything including capital letters that have already been defined as categories, brackets, even spaces can be graphemes.
4.2.1Word creation character escape
These are the characters you must escape if you want to use them in categories, segments and the words directive:
Characters | Meaning |
---|---|
; |
Comment |
\ |
Escapes a character after it |
C , D , K , ... |
Any one-length capital letter can refer to a category |
$ |
Defines a segment when followed by a capital letter |
, or |
Separates choices |
* |
Gives weight to an item |
[ and ] |
Pick-one-set |
( and ) |
Optional-set |
{ and } |
Supra-set item |
^ or ∅ |
A null grapheme |
\ |
Escapes a character after it |
4.2.2Transform character escape
These are the characters you must escape if you want to use them in the transform block:
Characters | Meaning |
---|---|
; |
Comment |
\ |
Escapes a character after it |
> , -> , => , ⇒ or → |
Indicates change |
, or |
Separates choices |
[ and ] |
Alternator-set |
( and ) |
Optionalator-set |
C , D , K , ... |
Any one-length capital letter can refer to a category |
^ or ∅ |
Insertion when in TARGET , deletion when in RESULT |
^REJECT or ^R |
Rejects a word |
/ |
A condition follows this character |
? |
A chance condition follows this character |
! |
An exception follows this character |
_ |
The underscore _ is a reference to the target |
# |
Word boundary |
+ |
Quantifier, matches as 1 or more of the previous grapheme |
+{ and } |
Bounded quantifier |
: |
Geminate-mark, matches exactly twice to the previous grapheme |
* |
Wildcard, matches exactly 1 of any grapheme |
~ or … |
Anythings-mark, matches 1 or more of any grapheme. It is non-greedy |
~{ or …{ and } |
Blocked-anythings-mark |
< |
Backreference |
| |
Engines are placed after this character, and a space |
5Categories
A category is a set of graphemes with a key. The key is a singular-length capital letter. For example:
This creates three groups of graphemes. C
is the group of all consonants, V
is the group of all vowels, and F
is the group of some of the consonants that will be used syllable finally.
These graphemes are separated by commas, however an alternative is to use spaces: C = t n k m ch l ꞌ s r d h w b y p g
.
By default, the graphemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. You can change this distribution. In the above example, when Vocabug needs to choose a V
, it will choose a
the most at 43%, i
the second-most at 26%, e
the third-most at 17%, u
the fourth-most at 10%, and o
the fifth most at 4%.
Need more than 26 categories? Vocabug supports the following additional characters as the key of a category or segment: Á Ć É Ǵ Í Ḱ Ĺ Ḿ Ń Ó Ṕ Ŕ Ś Ú Ẃ Ý Ź À È Ì Ǹ Ò Ù Ẁ Ỳ Ǎ Č Ď Ě Ǧ Ȟ Ǐ Ǩ Ľ Ň Ǒ Ř Š Ť Ǔ Ž Ä Ë Ḧ Ï Ö Ü Ẅ Ẍ Ÿ Γ Δ Θ Λ Ξ Π Σ Φ Ψ Ω
5.1Categories inside categories and set-categories
You can use categories inside categories, as long as the referenced category has previously been defined. For example:
In the example above, V
has a 20% chance of being a long vowel.
You can also enclose a set of graphemes in square brackets [
and ]
. This is called a 'set-category'. This set will be treated as if it were a reference to a category in terms of frequency. For example, we could write the same example as this:
V = a, i, e, o, [aa, ii, ee, oo]
Assigning weights to categories in categories and set-categories is possible.
Categories inside categories and set-categories CANNOT be a part of any sequence. for example C = Xz
or C = x[c, d]
or C = [a, b][c, d]
will not give the results you might want. To get sequence-like behaviour like that, you will need to use segments.
6Building words
6.1Words
The words:
directive defines a set of 'word-shapes' that Vocabug will choose from to create words. A word-shape can consist of individual graphemes, categories, segments or a mixture of both.
By default, words are selected using the Zipf distribution. The first word-shape will be chosen the most often, then the second word-shape the second most often and so on. You can change this distribution. Below is a very simple example that will generate words with one to three CV syllables:
=
t,
n,
k,
m,
l,
s,
r,
d,
h,
w,
b,
j,
p,
g
V =
a,
i,
o,
e,
u
words: CV,
CVCV,
CVCVCV, VWord-shapes may alternatively be declared in the BEGIN words:
block. Allowing word-shapes to be declared over multiple lines, and allowing the use of comments between word-shapes:
,
CVCV
,
CVCVCV,
; This is a commentV
END
You must use the END
keyword on a new line to end the block.
6.2Segments
Segments are a system that provides an abbreviation of parts of a word-shape. Typically you would use it to define the shape of a syllable. Segments are defined similarly to categories, but with several important differences:
- Every segment's key starts with
$
.S = s
is a category;$S = s
is a segment. - Segments are not sets like categories are.
$M = a, b, c
will not work as you might expect (because as already stated, segments are abbreviation for word-shapes). You would need to use a pick-one-set, i.e:$M = [a, b, c]
For example you could write the last example like so:
=
t,
n,
k,
m,
l,
s,
r,
d,
h,
w,
b,
j,
p,
g
V =
a,
i,
o,
e,
u$S = CV words: $S $S$S $S$S$S
You can put segments inside segments.
6.3Pick-one-set
A pick-one-set is a group of graphemes and categories separated by spaces or commas, enclosed in square brackets [
and ]
. Vocabug will pick an option from that pick-one just like it would from a segment. For example:
This will produce either ta
, tu
or tx
.
Pick-one-sets can be nested inside each other.
Anything inside the pick-one can be assigned a weight, and a pick-one itself can be assigned a weight as well if it is nested inside another set:
[
a*1
, b*2
, [
c, d]
*2
]
6.4Optional-set
Using round brackets, (
and )
, optional-set works the same way as pick-one-set, the only difference is that what's inside them can either appear in the word or not. The probability of each of these variants is 10% by default.
In the above example, there is a 10% chance of getting one of tan
, tat
or tal
, but a 90% chance of ta
.
6.4.1Optionals weight
By default, an optional-set has a 10% chance of being included in the word. You can change this probability with the optionals-weight:
directive.
6.5Supra-set
A 'supra-set', is applied over the entire word, and there can only be one supra set. Curly brackets {
and }
, denotes each item in the supra-set and their location in the word. The items of a supra-set can only be a category, or the null grapheme ^
. Only one item in the supra-set will be picked for that generated word.
Supra-set is a feature designed to help generate words with stress systems, pitch accent systems, or other word-based suprasegmentals. Here is an example where it is used for a stress system:
V = a
X = '
words: ({X}CV){X}CV
This produces any of the following words: 'ta
, ta'ta
, 'tata
, never any words with more than one '
. Notice here that ta
is not possible -- A supra-set item is only chosen after dealing with any sets that the supra-set items are nested in.
See the "Romance-like" example for a language that uses supra-set for its stress system.
6.5.1Supra-set weight
You can set the weights of supra-set items like so:
N = n
$X = ka{M*8}
$Y = te{N*2}
words: $X$Y
The above example has an 80% chance of generating kamte
and a 20% chance of generating katen
.
Supra-set item weights support a sentinel value -- a 'super-heavy' value s
. This s
will ensure the supra-set item attached to this weight is always chosen over others. For example: {V*s}
7Default distributions
The ordering of items matters in categories, segments and word-shapes. The first item will be chosen the most often, the second grapheme the second most often, and so on.
You can change these default distributions (another name for this might be "default drop-off", but I digress). For categories, the default is gusein-zade
and you change it with the category-distribution:
directive. For the separate setting for word-shapes, the default is zipfian
and you change it with the wordshape-distribution:
directive. The distribution will be applied to each item in a set, and then recursively to any set that set is nested in (treating the nested set as an item), then applied at the surface level.
- A
zipfian
distribution approximates natural language frequency for words, where the highest-ranked item receives the greatest weight, and subsequent ones decay steeply until flattening out. - A
gusein-zade
distribution offers a gentler slope that is natural across phonemes in a language, following a logarithmic decay that still prioritizes top-ranked items but spreads weight more evenly Shallow
distribution, the red-headed step-child of the distributions. It doesn't occur in natural linguistics, but offers us something between Flat and Gusein-Zade. It is Zipfian in nature, a 'long-tailed Zipfian distribution'- A
flat
distribution treats all items equally. This is not to say the items will be evenly chosen -- items are still being randomly chosen on a generation, they just have the same weight

8Assigning weights
If you want to set your own frequency for graphemes in a category or category-set, items in a pick-one-set, or optional-set, or word-shapes in the words:
directive, you can use an asterisk *
to specify the weight for each item, like so:
V
has approximately the following probabilities: a: 33%, e: 27%, i: 20%, o: 13%, u: 7%. The pick-one-set in the $S
segment has an 80% chance of producing a V category over the x grapheme. And the first word-shape in the words:
directive has twice the chance of being chosen over the next word-shape.
As you might have noticed in the example above, in a sequence that has at least one weighted option, it overwrites any default distributions. Also important to note is that any other option that you had not given a weight (inside that set, or on the surface level), is given a weight of 1.
9Alphabetisation
The alphabet directive gives Vocabug a custom alphabetisation order for words, when the sort words checkbox is selected.
This would order generated words like so: cat chat cumin frog tray t'a yanny
9.1Invisibility
Sometimes you will want characters, such as syllable dividers, to be invisible to alphabetisation. You can do this by listing these characters in the invisible: directive.
This would order generated words ˈpa.ta ˈca.ta za.ˈta ca.ˈa
as ca.ˈa, ˈca.ta, ˈpa.ta, za.ˈta
10Defining graphemes
The graphemes:
directive tells Vocabug which (multi)graphs, including character + combining diacritics, are to be treated as grapheme units when using transformations.
In the above example, we defined ch
as a grapheme. This would stop a rule such as c -> g
changing the word chat
into ghat
, but it will make cobra
change into gobra
.
11Transform
Once words are generated, you might want to modify them to prevent certain sequences, outright reject certain words, or simulate historical sound changes. This is the purpose of the transform block, which implements the NeSCA program.
All transforms must be used inside this block. To terminate this block you use an END
line. However, all unterminated blocks are automatically terminated at the end of the definition-build:
; Your rules go here
END
A rule can be summarised in four fields: CHANGE / CONDITION ! EXCEPTION
. The characters /
and !
that precede each field (except for the CHANGE
) are necessary for signalling each field. For example, including a !
will signal that this rule contains an exception, and all text following it until the next field marker will be interpreted as such.
Every rule begins on a new line and must contain a CHANGE
. The CONDITION
or EXCEPTION
fields are optional.
If you want to capture graphemes that are normally syntax characters in transforms, you will need to escape them.
When this document uses examples to explain transformations, the last comment shows an example word transforming. For example ; amda ==> ampa
means the rule will transform the word amda
into ampa
12The change
The format of the change can be expressed as TARGET -> RESULT
.
TARGET
specifies which part of the word is being changed- Then followed by a space and the
>
character.>
can be swapped with either->
,=>
,⇒
or→
if you prefer RESULT
is whatTARGET
is changing into, or in other words, replacing
Let's look at a simple unconditional rule:
o -> x
; bodido ==> bxdidx
In this rule, we see every instance of o
become x
.
12.1Concurrent change
Concurrent change is achieved by listing multiple graphemes in TARGET
separated by commas, and listing the same amount of resultant graphemes in RESULT
separated by commas. Changes in a concurrent change execute at the same time:
o, a -> a, o
; boda ==> bado
Notice that the above example is different to the example below:
a -> o
; boda ==> bodo
where each change is on its own line. We can see o
merge with a
, then a
becomes o
.
12.2Merging change
Instead of listing each RESULT
in a concurrent change, we can instead list just one that all the TARGET
s will merge into:
o, a -> x
; boda ==> bxdx
This is equivalent to:
o, a -> x, x
; boda ==> bxdx
12.3Reject
To remove, or in other words, reject a word, you use the ^REJECT
keyword in RESULT
:
In the above example, any word that contains a
or bi
will be rejected.
A shorthand version to ^REJECT
is ^R
13Insertion and deletion
Insertion requires a condition to be present, and for a caret ^
to be present in TARGET
, representing nothing.
Deletion happens when ^
is present in RESULT
:
14The condition
Conditions follow the change and are placed after a forward slash. When a transform has a condition, the target must meet the environment described in the condition to execute.
The format of a condition is / BEFORE_AFTER
BEFORE
is anything in the word before the target- The underscore
_
is a reference to the target in a condition AFTER
is anything in the word after the target
For example:
o -> x / p_p
; opoptot ==> opxptot
14.1Multiple conditions in one rule
Multiple conditions for a single rule can be made by separating each condition with additional forward slashes. The change will happen if it meets either, or both of the conditions:
o -> x / p_p / t_t
; opoptot ==> opxptxt
14.2Word boundary
#
matches to word boundaries. Either the beginning of the word if it is in TARGET
, or the end of the word if it is in RESULT
; opoppop ==> opoppxp
14.3The chance condition
The chance condition is placed following a ?
as a number from 0 to 100. This number represents the chance of the transformation occuring:
In the above example, the transformation will execute only 30% of the time.
15The exception
Exceptions are placed following a !
and go after the condition, if there is one. Exceptions function exactly like the opposite of the condition -- when a transform has an exception, the target must meet the environment described in the exception to prevent execution:
In the above example, the transformation will not execute if aa
is at the end of the word.
16Using categories
You can reference categories in transforms. The category will behave in the same way as an alternator set:
BEGIN transform:
B -> ^
; xapay ==> apa
If the category is part of a target, it MUST be inside an alternator set:
BEGIN transform:
[B]v -> ^
; xvapay ==> apay
17Alternator and Optionalator
These are designed to be fussy, they cannot be nested, they cannot stand on their own.
17.1Alternator-set
Enclosed in square brackets, [
and ]
, only one Item in an alternator set will be part of each sequence. For example:
The above example is equivalent to:
These can also be used in exceptions and conditions.
17.2Optionalator-set
Items in an optionalator, enclosed in (
and )
can be captured whether or not they appear as part of a grapheme or as part of a sequence of graphemes:
x(w) -> k
; xwaxaħa ==> kakaħa
Optional-set can also attach to an alternator-set:
[x, ħ](w) -> k
; xwaxaħa ==> kakaka
Optionalator-set cannot be used on its own, it must be connected to other content.
18Cluster-field
Cluster-field is a way to target sequences of graphemes and change them. They are laid out as tables, and start with %
followed by a space. The first part of a sequence is in the first column, and the second part is in the first row. For example:
- In this example,
np
becomes mp andmt
becomes nt +
means to not change the target cluster at all-
means to reject the word if it contained that sequence- Cluster-fields can use
^
or∅
to delete the target sequence - These are executed concurrently just like concurrent changes. Their order does not matter
- Clusterfields can also use conditions and exceptions, just put them on their own line
19Wildcards, repetition and positioning
Wildcards and the like in this section are special tokens that can represent arbitrary amounts of arbitrary graphemes, which is especially useful when you don't know precisely how many, or of what kind of grapheme there will be between two target graphemes in a word.
19.1Quantifier
Quantifier, using +
, will match once or as many times as possible to the grapheme to the left of it. Quantifier cannot be used in RESULT
:
; raraaaaa ==> roro
19.2Bounded quantifier
The bounded quantifier matches as many times its digit(s), enclosed in +{
and }
, to the things to the left.
o -> x / r+{3}_
; ororrro ==> ororrrx
The digits in the quantifier can also be a range:
o+{2,4} -> x
; tootooooo ==> txtxo
At the beginning of the list, ,
represents all the possible numbers lower than the number to the right, not including zero.
o+{,4} -> x
; tootooooo ==> txtx
And finally at the end of the list, ,
represents all possible numbers larger than the number to the the left
o+{4,} -> x
; toootooooo ==> toootx
19.3Geminate-mark
Geminate-mark using colon :
, will match twice to the grapheme, or grapheme from a set or category, to the left of it. In other words, you can capture an item only when it is geminated using the geminate-mark:
; aaata => oata
Unlike quantifier, a geminate mark can be used in RESULT
:
; tat => taat
19.4Kleene-star
Occasionally, you may want to match a grapheme whether it exists, there is one of it, or there is multiple of it consecutively, known as a "Kleene-star". There is no dedicated character for a Kleene star. Instead, you wrap the content followed by a quantifier, in an optionalator:
; ruaruaaaaa ==> roro
19.5Wildcard
Wildcard, using asterisk *
, will match once to any grapheme. Wildcard does not match word boundaries. Wildcard cannot be used in RESULT
:
; Any grapheme becomes /x/ when any grapheme follows it
* -> x / _*
; aomp ==> xxxp
Wildcard can be placed by itself inside an optionalator (*)
, thereby allowing it to match nothing as well.
19.6Anythings-mark
The anythings-mark uses tilde ~
or the ellipsis character …
U+2026. It will match as many (but not zero) times to any grapheme as needed. For example:
; babitto => xto
As we can see, the rule matched b
followed by anything else until it reached the first t
, then stopped matching. Why did the anythings-mark not continue matching t
and beyond like *+
would? This is because it is non-greedy, or in other words, lazy. The anythings-mark will continue matching graphemes until a grapheme that would be matched matches an item following the anythings-mark.
The example below uses an optional anythings-mark in the condition:
a, i, u -> ã, ĩ, ũ / [ã, ĩ, ũ]~_
; pabãdruliga ==> pabãdrũlĩgã
19.7Blocked-anythings-mark
Blocked-anythings-mark is designed to block the spreading behaviour of the anythings-mark when certain graphemes are ahead of it. You enclose a set of graphemes inside ~{
and }
that will block spreading. For example we might want the graphemes k
or g
to prevent the rightward spread of nasal vowels to non nasal vowels:
; pabãdruliga ==> pabãdrũlĩga
19.8Backreference
A backreference is a reference to the captured target. It cannot be used in TARGET
. This uses the less-than symbol <
.
Here are some prime examples where backreference is employed:
Full reduplication:
"Haplology":
Reject a word when a word-initial consonant is identical to the next consonant:
20Advanced rules
20.1Engine
The engine statement provides useful functions that you can call at any point in the definition-build. You call these engines following a |
and a space on a new line. You can also call a list of these functions in one line. For example: | compose, capitalise
decompose
will break-down all characters in a word into their "Unicode Normalization, Canonical Decomposition" form. For example,ñ
as a singular unicode entity, \u00F1, will be broken-down into a sequence of two characters,n
\u006E +◌̃
\u0303compose
does the opposite of decompose. It converts all characters in a word to the "Unicode Normalization, Canonical Decomposition followed by Canonical Composition" form. For example,ñ
as two characters \u006E\u0303, will be transformed into one character, \u00F1capitalise
will convert the first character of a word to uppercasedecapitalise
will convert the first character of a word to lowercaseto-upper-case
will convert all characters of a word to uppercaseto-lower-case
will convert all characters of a word to lowercasexsampa_to_ipa
will convert graphemes of a word written in X-SAMPA into IPAipa_to_xsampa
will convert graphemes of a word written in IPA into X-SAMPA
21Questions and answers
Here are some common questions and answers about Vocabug:
The Generate button is greyed out
This means Vocabug is busy generating words for you, and will eventually become clickable again. If you think this is taking too long, perhaps you have force word limit
accidentally on.
I received the error "Invalid regular expression"
This error occurs because you are using Vocabug in an old browser or old browser version that does not support lookbehind. You can check if this applies to you here.
How do I target syllables or syllable division in transforms?
Vocabug does not have a built-in way to target syllables, but you can use a .
character as a syllable divider like words: $S.$S, $S.$S.$S
, and then reference it in transforms.
What is a natural frequency for consonants in a language?
There is no one-size-fits-all answer to this question, and different analyses of word lists may produce different data on what the general expectation is. For example, in English, /ð/ is very uncommon among all the words in English, however it is a common phoneme among sentences because of the prevalence of the words this
, that
, those
and the
. And indeed, morphology and historical sound changes can skew any initial control you might have over frequencies. The main takeaway here is that phonemes that are easy to pronounce and distinguish will be the most common.
However, a good rule of thumb is that the most common class will come from 'Class A', then 'Class B', and finally 'Class Z'.
- Class A:
nasals
, andvoiceless plosives
, with alveolar consonants in this class being much more common, such ast
andn
(an exception to this being Australian languages with their word-initial consonants). Certain phonemes such asŋ
may have a much lower frequency depending on the language. Certain alveolar phonemes that would belong in Class B such ass
may belong in this class, depending on the language. - Class B: The next most common class... is everything else not in Class A or Z. Some broad patterns across languages you might observe are
g
having a much lower frequency compared to the othervoiced plosives
, andaffricates
being particularly low in frequency inside this class. - Think about which sounds you want to be rare in your conlang, that would drop off sharply. These sounds belong in 'Class Z'. Natural languages with very few consonants such as Hawaiian don't have this class. In English, these sounds are /θ/ and /ʒ/, which drop off sharply.
How do I weight an individual optional-set?
Using the Optionals-weight:
directive, affects the weight of all optional-sets. As of version 1, there is no direct way to weight an individual optional-set. You can however, use ^
as an item in an alternator, like a[b, c, ^*3]