Vocabug-lite
docs
Version 0.0.7
Contents
- About Vocabug
- Interface
- About graphemes
- Categories
- Building words
- Default distributions
- Assigning weights
- Alphabetisation
- Transform
- The change
1About Vocabug
This is the complete documentation for Vocabug-lite, version 0.0.7
This is a word generator designed to be a successor to the Williams' Lexifer and to the legendary Awkwords. You can find it's repository here. As the name implies, Vocabug-lite, is the 'lite' version of the full Vocabug, which is yet to be released.
Vocabug randomly generates vocabulary from a given definition of graphemes, frequencies and word patterns. You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.
2Interface
- Use the
Generate
button to see Vocabug produce words. If this button is greyed out it means that Vocabug is busy generating words - Use the
Copy
button to copy the words to the clipboard - Use the
Clear
button to clear all fields and generated words
2.1Options
- Use the
Number of words
textbox to choose the number of words to generate. The default number is 100 Word-list mode
will produce a list of wordsParagraph mode
will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each sentenceDebug mode
will show, line by line, each step in creating each wordRemove duplicates
will make sure all words generated are uniqueSort words
andCapitalise words
should be self explanatory- The
Word divider
textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space "\n
to get one word for each line
2.2File save / load
- Use the
Save
button to download the definition-build as a file called 'vocabug.txt', or whatever you named your file in theFile name:
field. The file is always a ".txt" type - Use the
Load
button to load a file from your system into the definition-build editor
3About graphemes
Graphemes are indivisible meaningful characters that make a generated word. Phonemes can be thought of as graphemes. If we use English words sky
and shy
as examples to illustrate this, sky
is made up by the graphemes s
+ k
+ y
, while shy
is made up by sh
+ y
.
3.1Null grapheme
If a word is built using a caret ^
, the caret(s) will disappear in the generated word. In other words ^
is a null grapheme. If you want to use ^
as a grapheme, you will need to escape it. To use other syntax characters as graphemes, they must be escaped too.
3.2Escaping characters
A single-length character following the syntax character \
ignores any meaning it might have had in the generator, including backslashes themselves. This way, anything including capital letters that have already been defined as categories, brackets, even spaces can be graphemes.
3.2.1Word creation character escape
These are the characters you must escape if you want to use them in categories, segments and the words directive:
Characters | Meaning |
---|---|
C , D , K , ... |
Any one-length character can refer to a category |
$ |
Defines a segment when followed by a capital letter |
, |
Separates choices |
|
Space, separates choices. An alternative to commas |
* |
Gives weight to an item |
[ , ] |
Pick-one-set |
( , ) |
Optional-set |
{ , } |
Supra-set item |
^ , ∅ |
A null grapheme |
\ |
Escapes a character after it |
3.2.2Transform character escape
These are the characters you must escape if you want to use them in the transform block:
Characters | Meaning |
---|---|
# |
Word boundary |
^ , ∅ |
Deletion when in RESULT |
\ |
Escapes a character after it |
4Categories
A category is a set of graphemes with a key. The key is a singular-length capital letter. For example:
C = t, n, k, m, ch, l, ꞌ, s, r, d, h, w, b, y, p, g F = n, l, ꞌ, t, k, r, p V = a, i, e, u, o
This creates three groups of graphemes. C
is the group of all consonants, V
is the group of all vowels, and F
is the group of some of the consonants that will be used syllable finally.
These graphemes are separated by commas, however an alternative is to use spaces: C = t n k m ch l ꞌ s r d h w b y p g
.
By default, the graphemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Vocabug needs to choose a V
, it will choose a
the most at 43%, i
the second-most at 26%, e
the third-most at 17%, u
the fourth-most at 10%, and o
the fifth most at 4%.
4.1Categories inside categories and set-categories
You can use categories inside categories, for example:
default-category-distribution: flat
L = aa, ii, ee, oo V = a, i, e, o, L
In the example above, V
has a 20% chance of being a long vowel.
You can also enclose a set of graphemes in square brackets [
and ]
. This is called a 'set-category'. This set will be treated as if it were a reference to a category in terms of frequency. For example, we could write the same example as this:
default-category-distribution: flat
V = a, i, e, o, [aa, ii, ee, oo]
Assigning weights to categories in categories and set-categories is possible.
Categories inside categories and set-categories CANNOT be a part of any sequence. for example C = Xz
or C = x[c, d]
or C = [a, b][c, d]
will not give the results you might want. To get sequence-like behaviour like that, you will need to use segments.
5Building words
5.1Words
The words:
directive defines a set of 'word-shapes' that Vocabug will choose from to create words. A word-shape can consist of individual graphemes, categories, segments or a mixture of both.
By default, words are selected using the Zipf distribution. The first word-shape will be chosen the most often, then the second word-shape the second most often and so on. Below is a very simple example that will generate words with one to three CV syllables:
C = t, n, k, m, l, s, r, d, h, w, b, j, p, g V = a, i, o, e, u words: CV, CVCV, CVCVCV
5.2Segments
Segments are a system that provides an abbreviation of parts of a word-shape. Typically you would use it to define the shape of a syllable. Segments are defined similarly to categories, but with several important differences:
- Every segment's key starts with
$
.S = s
is a category;$S = s
is a segment. - Segments are not sets like categories are.
$M = a, b, c
will not work as you might expect (because as already stated, segments are abbreviation for word-shapes). You would need to use a pick-one-set, i.e:$M = [a, b, c]
For example you could write the last example like so:
$S = CV words: $S $S$S $S$S$S
5.3Pick-one-set
A pick-one-set is a group of graphemes and categories separated by spaces or commas, enclosed in square brackets [
and ]
. Vocabug will pick an option from that pick-one just like it would from a segment. For example:
V = a, u words: t[V, x]
This will produce either ta
, tu
or tx
.
Pick-one-sets can be nested inside each other.
Anything inside the pick-one can be assigned a weight, and a pick-one itself can be assigned a weight as well if it is nested inside another set:
words: [a*1, b*2, [c, d]*2]
5.4Optional-set
Using round brackets, (
and )
, optional-set works the same way as pick-one-set, the only difference is that what's inside them can either appear in the word or not. The probability of each of these variants is 10% by default.
words: ta(n, t, l)
In the above example, there is a 10% chance of getting one of tan
, tat
or tal
, but a 90% chance of ta
.
5.4.1Optionals weight
By default, an optional-set has a 10% chance of being included in the word. You can change this probability.
5.5Supra-set
A supra-set, is applied over the entire word, and there can only be one supra set. Curly brackets {
and }
, denotes each item in the supra-set and their location in the word. The items of a supra-set can only be a category, or the null grapheme. Only one item in the supra-set will be picked for that generated word.
Supra-set is a feature designed to help generate words with stress systems, pitch accent systems, or other word-based suprasegmentals. Here is an example where it is used for a stress system:
C = t
V = a
X = '
words: ({X}CV){X}CV
This produces any of the following words: 'ta
, ta'ta
, 'tata
, never any words with more than one '
. Notice here that ta
is not possible -- A supra-set item is only chosen after dealing with any sets that the supra-set items are nested in.
See the "Romance-like" example for a language that uses supra-set for its stress system.
5.5.1Supra-set weight
You can set the weights of supra-set items like so:
M = m
N = n
$X = ka{M*8}
$Y = te{N*2}
words: $X$Y
The above example has an 80% chance of generating kamte
and a 20% chance of generating katen
.
6Default distributions
The ordering of items matters in categories, segments and word-shapes. The first item will be chosen the most often, the second item the second most often, and so on.
You can change these default distributions (another name for this might have been "drop-off", but I digress). For categories, the default is gusein-zade
, and for the separate setting for word-shapes, the default is zipfian
. The distribution will be applied to each item in a set, and then recursively to any set that set is nested in (treating the nested set as an item), then applied at the surface level.
- A
zipfian
distribution approximates natural language frequency for words, where the highest-ranked item receives the greatest weight, and subsequent ones decay steeply until flattening out. - A
gusein-zade
distribution offers a gentler slope that approximates a natural distribution of phonemes (for consonants, vowels, etc.) in a language, It follows a logarithmic decay that still prioritizes top-ranked items but spreads weight more evenly Shallow
distribution, the red-headed step-child of the distributions. It doesn't occur in natural linguistics, but offers us something between Flat and Gusein-Zade. It is Zipfian in nature, a 'long-tailed Zipf'- A
flat
distribution treats all items equally. This is not to say the items will be evenly chosen -- items are still being randomly chosen on a generation, they just have the same weight

7Assigning weights
If you want to set your own frequency for graphemes in a category or category-set, items in a pick-one-set, or optional-set, or word-shapes in the words:
directive, you can use an asterisk *
to specify the weight for each item, like so:
V = a*5, e*4, i*3, o*2, u*1 $S = [V*8, x*2] words: $S*2 y
V
has approximately the following probabilities: a: 33%, e: 27%, i: 20%, o: 13%, u: 7%. The pick-one-set in the $S
segment has an 80% chance of producing a V category over the x grapheme. And the first word-shape in the words:
directive has twice the chance of being chosen over the next word-shape.
As you might have noticed in the example above, in a sequence that has at least one weighted option, it overwrites any default distributions. Also important to note is that any other option that you had not given a weight (inside that set, or on the surface level), is given a weight of 1.
8Alphabetisation
The alphabet directive gives Vocabug a custom alphabetisation order for words, when the sort words checkbox is selected.
alphabet: a, b, c, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y
This would order generated words like so: cat chat cumin frog tray t'a yanny
9Transform
Once words are generated, you might want to modify them to prevent certain sequences, outright reject certain words, or simulate historical sound changes.
When this document uses examples to explain transforms, the last comment (comments follow a semicolon) shows an example word transforming. For example ; amda ==> ampa
means the rule will transform the word amda
into ampa
9.1Defining graphemes
The graphemes:
directive tells Vocabug which (multi)graphs, including character + combining diacritics, are to be treated as grapheme units when using transformations.
graphemes: a, b, c, ch, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y
In the above example, we defined ch
as a grapheme. This would stop a rule such as c -> g
changing the word chat
into ghat
, but it will make cobra
change into gobra
.
10The change
In Vocabug-lite there is just one field to a transform, the CHANGE
. In Vocabug-lite they are always unconditional.
The format of the change can be expressed as TARGET -> RESULT
.
TARGET
specifies which part of the word is being changed- Then followed by
→
RESULT
is whatTARGET
is changing into, or in other words, replacing
Let's look at a simple unconditional transformation:
; Replace every /o/ with /x/
o -> x
; bodido ==> bxdidx
In this rule, we see every instance of o
become x
.
10.1Concurrent change
A concurrent change is achieved by listing multiple items in TARGET
, and listing the same amount of resultant items in RESULT
separated by commas or spaces. Changes in a concurrent change execute at the same time:
; Switch /o/ and /e/ around
o, a -> a, o
; boda ==> bado
Notice that the above example is different to the example below:
o -> a
a -> o
; boda ==> bodo
where each change is on its own line. We can see o
merge with a
, then a
becomes o
.
10.2Word boundary
#
matches to word boundaries in RESULT
. Either the beginning of the word if it is at the beginning, or the end of the word if it is at the end:
o# -> x
; opo ==> opx
10.3Reject
To remove, or in other words, reject a word, you use the ^REJECT
keyword in RESULT
, by itself:
a, bi -> x, ^REJECT
In the above example, any word that contains bi
will be rejected.
A shorthand to write ^REJECT
is ^R
10.4Deletion
Deletion happens when ^
, or ∅
is present in RESULT
; delete every /b/ b -> ^ ; bubda ==> uda
10.5Cluster-field
Cluster-field is a way to target sequences of graphemes and change them. They are laid out as tables, and start with %
followed by a space. The first part of a sequence is in the first column, and the second part is in the first row. For example:
% p t k m n m + nt nk + mm n mp + + nn +
- In this example,
np
becomes mp andmt
becomes nt +
means to not change the target cluster at all-
means to reject the word if it contained that sequence- Cluster-fields can use
^
or∅
to delete the target sequence - These are executed concurrently just like concurrent changes. Their order does not matter