
Vocabug
documentation
Version 0.0.0
Contents
- About Vocabug
- Interface
- Using comments
- About graphemes
- Categories
- Assigning weights
- Building words
- Alphabetisation and graphs
- Transform
1About Vocabug
This is the complete documentation for Vocabug, version 0.0.0
Vocabug randomly generates vocabulary from a given definition of graphemes, frequencies and word patterns. You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.
2Interface
- The textbox at the top of the program is the
definition-build editor
. A definition-build defines the graphemes, frequencies, word-shapes and transformations that generate the final words. There will already be a default definition-build in the definition-build editor, or the previous definition-build that generated words - Use the
Generate
button to see Vocabug produce words - Use the
Copy
button to copy the words to the clipboard
2.1Options
- Use the
Number of words
textbox to choose the number of words to generate. The default number is 100 Word-list mode
will produce a list of wordsParagraph mode
will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each sentenceDebug mode
will show, line by line, each step in creating each wordEditor wrap lines
will make the definition-build editor jump to the next line if the line escapes the width of the definition-build editorShow keyboard
will reveal a 'keyboard' below the editorRemove duplicates
will make sure all words generated are uniqueForce words
will force the generator to try and generate the complete number of words requested within 30 seconds, despite the number of rejections / duplicates removedSort words
andCapitalise words
should be self explanatory- The
Word divider
textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space "\n
to get one word for each line
2.2File save / load
- Use the
Save
button to download the definition-build as a file called 'vocabug.txt', or whatever you named your file in theFile name:
field. The file is always a ".txt" type - Use the
Load
button to load a file from your system into the definition-build editor - Use the buttons in the
Examples
dropdown to load a number of example definition-builds into the definition-build editor
4About graphemes
Graphemes are indivisible meaningful characters that make a generated word in Vocabug. Phonemes can be thought of as graphemes. If we use English words sky
and shy
as examples to illustrate this, sky
is made up by the graphemes s
+ k
+ y
, while shy
is made up by sh
+ y
.
4.1Null grapheme
If a word is built using the syntax character ~
, it will disappear in the generated word. In other words ~
is a null grapheme. If you want to use ~
as a grapheme, you will need to escape it. To use other syntax characters as graphemes, they must be escaped too.
4.2Escaping characters
A character after the syntax character \
ignores any meaning it might have had in the generator, including backslashes themselves. This way, anything including capital letters that have already been defined as categories, brackets, even spaces can be generated or targeted in transformations
4.2.1Word creation character escape
These are the characters you must escape if you want to use them in in categories, segments and the words directive:
Characters | Meaning |
---|---|
; |
Comment |
{& , } |
HTML Entities |
C , D , Ḱ , ... |
Any one-length character can refer to a category |
{ , } |
References a long form category name |
, |
Separates choices |
|
Space, separates choices. An alternative to commas |
$ |
Defines a segment |
: |
Gives weight to a grapheme, segments, set, or word-shape |
? |
Gives probability of an optional-set being chosen |
@ |
Gives weight of an inter-set being chosen over others |
[ , ] |
Pick-one-set |
( , ) |
Optional-set |
< , > |
Inter-set |
\ |
Escapes a character after it |
4.2.2Transform character escape
These are the characters you must escape if you want to use them in the transform block.
Characters | Meaning |
---|---|
; |
Comment |
> , -> , => , ⇒ , → |
Indicates change |
, |
Separates choices |
[ , ] |
Concurrent-set or merging-set |
( , ) |
Optional-set |
~REJECT |
Rejects a word |
/ |
The condition follows this character |
_ |
The underscore _ is a reference to the target |
# |
Word boundary |
$ |
Syllable boundary |
! |
The exception follows this character |
{ , } |
Category or feature-matrix |
* |
Wildcard, matches exactly 1 of any character |
" |
Ditto-mark, matches exactly 1 of the previous character |
+ |
Plus-mark, matches 1 or more of the previous character |
^ , … |
Anythings-mark, matches 1 or more of any character. It is non-greedy |
=[ , ] |
Quantifier |
<[ , ] |
Blocker |
@[ , ] |
Positioner |
~ |
Insertion when in TARGET , deletion when in RESULT |
| |
Indicates metathesis, and the reordered contents |
1 , 2 , ... 9 |
In a metathesis rule, in RESULT , these represent the changing graphemes |
\ |
Escapes a character after it |
4.3HTML entities
Enclosing it in curly brackets {
and }
, the name, or hex-code of a HTML entity will be decoded at the very end of the definition build. For example {À}
will give À
, and {☃}
will give {☃}
If using the name of a HTML entity, you must preface it with &
. When using the hex-code of a HTML entity, you must preface it with &#
. You cannot end the name or hex-code of the entity with a semicolon.
5Categories
A category is a set of graphemes with a name. The name is usually a singular-length character, but can be long-form. For example:
C = t, n, k, m, ch, l, ꞌ, s, r, d, h, w, b, y, p, g F = n, l, ꞌ, t, k, r, p V = a, i, e, u, o
This creates three groups of graphemes. C
is the group of all consonants, V
is the group of all vowels, and F
is the group of some of the consonants that will be used syllable finally.
These graphemes are separated by commas, however an alternative is to use spaces: C = t n k m ch l ꞌ s r d h w b y p g
. You may not use both commas and spaces as separators on the same line, i.e: A = a b, c
.
By default, the graphemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Vocabug needs to choose a V
, it will choose a
the most at 43%, i
the second-most at 26%, e
the third-most at 17%, u
the fourth-most at 10%, and o
the fifth most at 4%.
5.3Categories inside categories and set-categories
You can use categories inside categories, as long as the referenced category has previously been defined. For example:
class-drop-off: flat
L = aa, ii, ee, oo V = a, i, e, o, L
In the example above, V
has a 20% chance of being a long vowel.
You can also enclose a set of graphemes in square brackets [
and ]
. This is called a 'set-category'. This set will be treated as if it were a reference to a category in terms of frequency. For example, we could write the same example as this:
class-drop-off: flat
V = a, i, e, o, [aa, ii, ee, oo]
Assigning weights to categories in categories and set-categories is possible.
Categories inside categories and set-categories CANNOT be a part of any sequence. for example C = Xz
or C = x[c, d]
or C = [a, b][c, d]
will not give the results you might want. To get sequence-like behaviour like that, you will need to use segments.
6Assigning weights
If you want to set your own frequency for graphemes in a category or category-set, items in a pick-one-set, optional-set, or inter-set, or word-shapes in the words:
directive, you can use a colon :
to specify the weight for each item, like so:
V = a:5, e:4, i:3, o:2, u:1 $S = [V:8 x:2] words: $S:2 y
V
has approximately the following probabilities: a: 33%, e: 27%, i: 20%, o: 13%, u: 7%. The pick-one-set in the $S
segment has an 80% chance of producing a V category over the x grapheme. And the first word-shape in the words:
directive has twice the chance of being chosen over the next word-shape.
As you might have seen in the example above, in a sequence that has an option that has a weight, it overwrites any drop off frequencies. Also important to note is that any other option that you had not given a weight, is given a default weight of 1.
7Building words
7.1Words
The words:
directive defines a set of 'word-shapes' that Vocabug will choose from to create words. A word-shape can consist of individual graphemes, categories, segments or a mixture of both.
By default, words are selected using the Zipf distribution. The first word-shape will be chosen the most often, then the second word-shape the second most often and so on. Below is a very simple example that will generate words with one to three CV syllables:
C = t, n, k, m, l, s, r, d, h, w, b, j, p, g V = a, i, o, e, u words: CV, CVCV, CVCVCV
7.1.1Word-drop-off
This directive modifies how often the words' frequencies decrease as they go to the right, unless they have weights. The options are zipfian
, gusein-zade
, and flat
. The default is zipfian
.
It is better to not use this directive or give word-shapes weights -- it is an uphill battle. For example, if you chose to remove duplicates in the above example, it is already removing one syllable words the most often. And if you have paragraph mode turned on, you would want simple syllables to occur very often. So it is best to simply rearrange the word-shapes in the words directive to get good-looking results. Nevertheless, maybe you want to use a flat distribution because you are only generating CVCV syllables of different types, or generating something that doesn't play by the rules.
7.2Segments
Segments are a system that provides an abbreviation of parts of a word-shape. Typically you would use it to define the shape of a syllable. Segments are defined similarly to categories, but with several important differences:
- Every segment's name starts with
$
.S = s
is a category;$S = s
is a segment. - Segments are not sets like categories are.
$M = a, b, c
will not work. You would need to use a pick-one-set, i.e:$M = [a, b, c]
- Segments have an effect on the logic behind Inter-sets. In this sense, segments are not just abbreviation.
For example you could write the last example like so:
$S = CV words: $S $S$S $S$S$S
7.3Pick-one set
A pick-one-set is a group of graphemes and categories separated by spaces or commas, enclosed in square brackets [
and ]
. Vocabug will pick an option from that pick-one just like it would from a segment. For example:
V = a, u words: t[V, x]
This will produce either ta
, tu
or tx
.
Pick-one-sets can be nested inside each other.
Anything inside the pick-one can be assigned a weight, and a pick-one itself can be assigned a weight as well if it is nested inside another set:
words: [a:1, b:2, [c, d]:2]
7.4Optional-set
Using round brackets, (
and )
, optional-set works the same way as pick-one-set, the only difference is that what's inside them can either appear in the word or not. The probability of each of these variants is 10% by default.
words: ta(n, t, l)
In the above example, there is a 10% chance of getting one of tan
, tat
or tal
, but a 90% chance of ta
.
7.5Inter-set
An inter-set, using less and greater than signs <
and >
, works the same as pick-one-set. The difference is, only one inter-set will be chosen for that segment or word-shape.
Inter-set is a feature designed to help generate words with stress or pitch accent systems. Here is an example where it is used for a stress system:
C = t V = a $X = (<'>CV)<'>CV words: $X
This produces any of the following words: 'ta
, ta'ta
, 'tata
, never any words with a double '
. Notice here that ta
is not possible -- An inter-set set is only chosen after dealing with any sets any Inter-pick-one sets are nested in.
There are a few restrictions and peculiarities to it. Most notably, Inter-set may not be nested inside each other. Let's look at another example:
class-drop-off: flat words: <a, b><x>
The above example is rather silly, as there is nothing between each Inter-set, defeating its whole purpose. However it is useful as an example here in showing that it is equivalent to the example below, which uses pick-ones instead.
class-drop-off: flat words: [[a, b],[x]]
In both of the above examples, there is a 25% chance of producing a
, a 25% chance of b
, and a 50% chance of producing x
.
See the "Romance-like" example for a language that uses inter-set for its stress system, or the "BTX" example for a language that uses it for a complex pitch accent system.
8Alphabetisation and graphemes
The alphabet:
, graphs:
, alphabet-graphs:
and invisible:
directives can be an important element to your definition-build. Let's go over its uses.
8.1Alphabetisation
The alphabet directive gives Vocabug a custom alphabetisation order for words, when the sort words checkbox is selected.
alphabet: a, b, c, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y
This would order generated words like so: cat chat cumin frog tray t'a
8.2Defining graphs
The graphs:
directive tells Vocabug which (multi)graphs, including character + combining diacritics, are to be treated as grapheme units when using transformations.
graphs: a, b, c, ch, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y
In the above example, we defined ch
as a grapheme. This would stop a rule such as c -> g
changing the word chat
into ghat
, but it will make cobra
change into gobra
.
"But my list of graphemes is the same as my list of alphabeticalising letters, I don't want to list them twice", you might exclaim. Well, you can create an alphabetisation order and list your graphemes in one line using the alphabet-and-graphs:
directive.
8.2.1Alternative graphemes
The graphs:
directive can tell Vocabug which character + combining diacritic sequences are to be treated as alternatives of a base grapheme. Let's name these alternatives the 'children' and the base grapheme the 'parent'. You can do this by enclosing the 'children' in <[
and ]
as a set, directly after their 'parent'.
Important: The left-most precomposed character of a 'child' must be the same as its 'parent'.
This should be useful for tonal languages that mark tone with diacritics on vowels. In these tonal languages, we no longer need to list every variation of a vowel + diacritic to capture a vowel:
graphs: a, <[á, à, ā, ǎ], h, i, <[í, ì, ī, ǐ], k, l, m, n, o, <[ó, ò, ō, ǒ], t
a -> o
; mápǎ ==> mópǒ
However we can still capture a vowel with a tone mark, such as ǎ
:
ǎ -> o
; mápǎ ==> mápo
8.3Invisibility
Sometimes you will want characters, such as syllable dividers, to be invisible to alphabetisation. You can do this by listing these characters in the invisible-alphabet: directive.
invisible-alphabet: ., ˈ
This would order generated words ˈpa.ta ˈca.ta za.ˈta ca.ˈa
as ca.ˈa, ˈca.ta, ˈpa.ta, za.ˈta
9Transform
Once words are generated, you might want to modify them to prevent certain sequences, outright reject certain words, or simulate historical sound changes. This is the purpose of the transform block, which implements the NASC program.
All transformations must be used inside this block. To terminate a block you use an END
line. However, all unterminated blocks are automatically terminated at the end of the definition-build:
BEGIN transform:
; Your rules go here
END
A NASC rule can be summarised in four fields: CHANGE / CONDITION ! EXCEPTION
. The characters /
and !
that precede each field (except for the CHANGE
) are necessary for signalling each field. For example, including a !
will signal that this rule contains an exception, and all text following it until the next field marker will be interpreted as such.
Every rule begins on a new line and must contain a CHANGE
. The CONDITION
or EXCEPTION
fields are optional.
If you want to capture graphemes that are normally syntax characters in transformations, you will need to escape them.
When this document uses examples to explain transformations, the last comment shows an example word transforming. For example ; amda ==> ampa
means the rule will transform the word amda
into ampa
10The change
The format of the change can be expressed as TARGET -> RESULT
.
TARGET
specifies which part of the word is being changed- Then followed by a space and the
>
character.>
can be swapped with either->
,=>
,⇒
or→
if you prefer RESULT
is whatTARGET
is changing into, or in other words, replacing
Let's look at a simple unconditional rule:
; Replace every /o/ with /x/
o -> x
; bodido ==> bxdidx
In this rule, we see every instance of o
become x
.
10.1Concurrent set
A concurrent set in a change is achieved by listing multiple graphemes in TARGET
separated by commas in square brackets, and listing the same amount of resultant graphemes in RESULT
separated by commas in square brackets. Changes in a concurrent change execute at the same time:
; Switch /o/ and /e/ around
[o, a] -> [a, o]
; boda ==> bado
Notice that the above example is different to the example below:
o -> a
a -> o
; boda ==> bodo
where each change is on its own line. We can see o
merge with a
, then a
becomes o
.
In the above example, square brackets were used, but because the entire rule was a concurrent set, the square brackets are optional:
; Switch /o/ and /e/ around
o, a -> a, o
; boda ==> bado
10.4Reject
To remove, or in other words, reject a word, you use the ~REJECT
keyword in RESULT
:
a, bi -> ~REJECT
In the above example, any word that contains a
or bi
will be rejected.
11.3Word boundary
#
matches to word boundaries. Either the beginning of the word if it is in TARGET
, or the end of the word if it is in RESULT
o -> x / p_p#
; opoppop ==> opoppxp
16Insertion and deletion
Insertion requires a condition to be present, and for a tilde ~
to be present in TARGET
, representing nothing.
; insert /a/ in between /b/ and /t/ ~ -> a / b_t ; bt ==> bat
Deletion happens when ~
is present in RESULT
; delete every /b/ b -> ~ ; bubda ==> uda