Logo of letter V with bug antennae

Vocabug
documentation

Version 0.0.0

Contents

  1. About Vocabug
  2. Interface
    1. Options
    2. File save / load
  3. Using comments
  4. About graphemes
    1. Null grapheme
    2. Escaping characters
      1. Word creation character escape
      2. Transform character escape
    3. HTML Entities
  5. Categories
    1. Categories inside categories and category-sets
  6. Assigning weights
  7. Building words
    1. Words
      1. Word-drop-off directive
    2. Segments
    3. Pick-one-set
    4. Optional-set
    5. Inter-set
  8. Alphabetisation and graphs
    1. Alphabetisation
    2. Defining graphemes
      1. Alternative graphemes
    3. Invisibility
  9. Transform
    1. Concurrent-set
    2. Reject
    3. Insertion and deletion

1About Vocabug

This is the complete documentation for Vocabug, version 0.0.0

Vocabug randomly generates vocabulary from a given definition of graphemes, frequencies and word patterns. You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.

2Interface

2.1Options

2.2File save / load

4About graphemes

Graphemes are indivisible meaningful characters that make a generated word in Vocabug. Phonemes can be thought of as graphemes. If we use English words sky and shy as examples to illustrate this, sky is made up by the graphemes s + k + y, while shy is made up by sh + y.

4.1Null grapheme

If a word is built using the syntax character ~, it will disappear in the generated word. In other words ~ is a null grapheme. If you want to use ~ as a grapheme, you will need to escape it. To use other syntax characters as graphemes, they must be escaped too.

4.2Escaping characters

A character after the syntax character \ ignores any meaning it might have had in the generator, including backslashes themselves. This way, anything including capital letters that have already been defined as categories, brackets, even spaces can be generated or targeted in transformations

4.2.1Word creation character escape

These are the characters you must escape if you want to use them in in categories, segments and the words directive:

Characters Meaning
; Comment
{&, } HTML Entities
C, D, , ... Any one-length character can refer to a category
{, } References a long form category name
, Separates choices
Space, separates choices. An alternative to commas
$ Defines a segment
: Gives weight to a grapheme, segments, set, or word-shape
? Gives probability of an optional-set being chosen
@ Gives weight of an inter-set being chosen over others
[, ] Pick-one-set
(, ) Optional-set
<, > Inter-set
\ Escapes a character after it

4.2.2Transform character escape

These are the characters you must escape if you want to use them in the transform block.

Characters Meaning
; Comment
>, ->, =>, , Indicates change
, Separates choices
[, ] Concurrent-set or merging-set
(, ) Optional-set
~REJECT Rejects a word
/ The condition follows this character
_ The underscore _ is a reference to the target
# Word boundary
$ Syllable boundary
! The exception follows this character
{, } Category or feature-matrix
* Wildcard, matches exactly 1 of any character
" Ditto-mark, matches exactly 1 of the previous character
+ Plus-mark, matches 1 or more of the previous character
^, Anythings-mark, matches 1 or more of any character. It is non-greedy
=[, ] Quantifier
<[, ] Blocker
@[, ] Positioner
~ Insertion when in TARGET, deletion when in RESULT
| Indicates metathesis, and the reordered contents
1, 2, ... 9 In a metathesis rule, in RESULT, these represent the changing graphemes
\ Escapes a character after it

4.3HTML entities

Enclosing it in curly brackets { and }, the name, or hex-code of a HTML entity will be decoded at the very end of the definition build. For example {&Agrave} will give À, and {&#x2603} will give {☃}

If using the name of a HTML entity, you must preface it with &. When using the hex-code of a HTML entity, you must preface it with &#. You cannot end the name or hex-code of the entity with a semicolon.

5Categories

A category is a set of graphemes with a name. The name is usually a singular-length character, but can be long-form. For example:

C = t, n, k, m, ch, l, ꞌ, s, r, d, h, w, b, y, p, g
F = n, l, ꞌ, t, k, r, p
V = a, i, e, u, o

This creates three groups of graphemes. C is the group of all consonants, V is the group of all vowels, and F is the group of some of the consonants that will be used syllable finally.

These graphemes are separated by commas, however an alternative is to use spaces: C = t n k m ch l ꞌ s r d h w b y p g. You may not use both commas and spaces as separators on the same line, i.e: A = a b, c.

By default, the graphemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Vocabug needs to choose a V, it will choose a the most at 43%, i the second-most at 26%, e the third-most at 17%, u the fourth-most at 10%, and o the fifth most at 4%.

5.3Categories inside categories and set-categories

You can use categories inside categories, as long as the referenced category has previously been defined. For example:

class-drop-off: flat
L = aa, ii, ee, oo V = a, i, e, o, L

In the example above, V has a 20% chance of being a long vowel.

You can also enclose a set of graphemes in square brackets [ and ]. This is called a 'set-category'. This set will be treated as if it were a reference to a category in terms of frequency. For example, we could write the same example as this:

class-drop-off: flat
V = a, i, e, o, [aa, ii, ee, oo]

Assigning weights to categories in categories and set-categories is possible.

Categories inside categories and set-categories CANNOT be a part of any sequence. for example C = Xz or C = x[c, d] or C = [a, b][c, d] will not give the results you might want. To get sequence-like behaviour like that, you will need to use segments.

6Assigning weights

If you want to set your own frequency for graphemes in a category or category-set, items in a pick-one-set, optional-set, or inter-set, or word-shapes in the words: directive, you can use a colon : to specify the weight for each item, like so:

V = a:5, e:4, i:3, o:2, u:1
$S = [V:8 x:2]
words: $S:2 y

V has approximately the following probabilities: a: 33%, e: 27%, i: 20%, o: 13%, u: 7%. The pick-one-set in the $S segment has an 80% chance of producing a V category over the x grapheme. And the first word-shape in the words: directive has twice the chance of being chosen over the next word-shape.

As you might have seen in the example above, in a sequence that has an option that has a weight, it overwrites any drop off frequencies. Also important to note is that any other option that you had not given a weight, is given a default weight of 1.

7Building words

7.1Words

The words: directive defines a set of 'word-shapes' that Vocabug will choose from to create words. A word-shape can consist of individual graphemes, categories, segments or a mixture of both.

By default, words are selected using the Zipf distribution. The first word-shape will be chosen the most often, then the second word-shape the second most often and so on. Below is a very simple example that will generate words with one to three CV syllables:

C = t, n, k, m, l, s, r, d, h, w, b, j, p, g
V = a, i, o, e, u
words: CV, CVCV, CVCVCV

7.1.1Word-drop-off

This directive modifies how often the words' frequencies decrease as they go to the right, unless they have weights. The options are zipfian, gusein-zade, and flat. The default is zipfian.

It is better to not use this directive or give word-shapes weights -- it is an uphill battle. For example, if you chose to remove duplicates in the above example, it is already removing one syllable words the most often. And if you have paragraph mode turned on, you would want simple syllables to occur very often. So it is best to simply rearrange the word-shapes in the words directive to get good-looking results. Nevertheless, maybe you want to use a flat distribution because you are only generating CVCV syllables of different types, or generating something that doesn't play by the rules.

7.2Segments

Segments are a system that provides an abbreviation of parts of a word-shape. Typically you would use it to define the shape of a syllable. Segments are defined similarly to categories, but with several important differences:

For example you could write the last example like so:

$S = CV
words: $S $S$S $S$S$S

7.3Pick-one set

A pick-one-set is a group of graphemes and categories separated by spaces or commas, enclosed in square brackets [ and ]. Vocabug will pick an option from that pick-one just like it would from a segment. For example:

V = a, u
words: t[V, x]

This will produce either ta, tu or tx.

Pick-one-sets can be nested inside each other.

Anything inside the pick-one can be assigned a weight, and a pick-one itself can be assigned a weight as well if it is nested inside another set:

words: [a:1, b:2, [c, d]:2]

7.4Optional-set

Using round brackets, ( and ), optional-set works the same way as pick-one-set, the only difference is that what's inside them can either appear in the word or not. The probability of each of these variants is 10% by default.

words: ta(n, t, l)

In the above example, there is a 10% chance of getting one of tan, tat or tal, but a 90% chance of ta.

7.5Inter-set

An inter-set, using less and greater than signs < and >, works the same as pick-one-set. The difference is, only one inter-set will be chosen for that segment or word-shape.

Inter-set is a feature designed to help generate words with stress or pitch accent systems. Here is an example where it is used for a stress system:

C = t
V = a
$X = (<'>CV)<'>CV
words: $X

This produces any of the following words: 'ta, ta'ta, 'tata, never any words with a double '. Notice here that ta is not possible -- An inter-set set is only chosen after dealing with any sets any Inter-pick-one sets are nested in.

There are a few restrictions and peculiarities to it. Most notably, Inter-set may not be nested inside each other. Let's look at another example:

class-drop-off: flat
words: <a, b><x>

The above example is rather silly, as there is nothing between each Inter-set, defeating its whole purpose. However it is useful as an example here in showing that it is equivalent to the example below, which uses pick-ones instead.

class-drop-off: flat
words: [[a, b],[x]]

In both of the above examples, there is a 25% chance of producing a, a 25% chance of b, and a 50% chance of producing x.

See the "Romance-like" example for a language that uses inter-set for its stress system, or the "BTX" example for a language that uses it for a complex pitch accent system.

8Alphabetisation and graphemes

The alphabet:, graphs:, alphabet-graphs: and invisible: directives can be an important element to your definition-build. Let's go over its uses.

8.1Alphabetisation

The alphabet directive gives Vocabug a custom alphabetisation order for words, when the sort words checkbox is selected.

alphabet: a, b, c, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y

This would order generated words like so: cat chat cumin frog tray t'a

8.2Defining graphs

The graphs: directive tells Vocabug which (multi)graphs, including character + combining diacritics, are to be treated as grapheme units when using transformations.

graphs: a, b, c, ch, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y

In the above example, we defined ch as a grapheme. This would stop a rule such as c -> g changing the word chat into ghat, but it will make cobra change into gobra.

"But my list of graphemes is the same as my list of alphabeticalising letters, I don't want to list them twice", you might exclaim. Well, you can create an alphabetisation order and list your graphemes in one line using the alphabet-and-graphs: directive.

8.2.1Alternative graphemes

The graphs: directive can tell Vocabug which character + combining diacritic sequences are to be treated as alternatives of a base grapheme. Let's name these alternatives the 'children' and the base grapheme the 'parent'. You can do this by enclosing the 'children' in <[ and ] as a set, directly after their 'parent'.

Important: The left-most precomposed character of a 'child' must be the same as its 'parent'.

This should be useful for tonal languages that mark tone with diacritics on vowels. In these tonal languages, we no longer need to list every variation of a vowel + diacritic to capture a vowel:

  graphs: a, <[á, à, ā, ǎ], h, i, <[í, ì, ī, ǐ], k, l, m, n, o, <[ó, ò, ō, ǒ], t
a -> o
; mápǎ ==> mópǒ

However we can still capture a vowel with a tone mark, such as :

  ǎ -> o
; mápǎ ==> mápo

8.3Invisibility

Sometimes you will want characters, such as syllable dividers, to be invisible to alphabetisation. You can do this by listing these characters in the invisible-alphabet: directive.

invisible-alphabet: ., ˈ

This would order generated words ˈpa.ta ˈca.ta za.ˈta ca.ˈa as ca.ˈa, ˈca.ta, ˈpa.ta, za.ˈta

9Transform

Once words are generated, you might want to modify them to prevent certain sequences, outright reject certain words, or simulate historical sound changes. This is the purpose of the transform block, which implements the NASC program.

All transformations must be used inside this block. To terminate a block you use an END line. However, all unterminated blocks are automatically terminated at the end of the definition-build:

BEGIN transform:
; Your rules go here
END

A NASC rule can be summarised in four fields: CHANGE / CONDITION ! EXCEPTION. The characters / and ! that precede each field (except for the CHANGE) are necessary for signalling each field. For example, including a ! will signal that this rule contains an exception, and all text following it until the next field marker will be interpreted as such.

Every rule begins on a new line and must contain a CHANGE. The CONDITION or EXCEPTION fields are optional.

If you want to capture graphemes that are normally syntax characters in transformations, you will need to escape them.

When this document uses examples to explain transformations, the last comment shows an example word transforming. For example ; amda ==> ampa means the rule will transform the word amda into ampa

10The change

The format of the change can be expressed as TARGET -> RESULT.

Let's look at a simple unconditional rule:

; Replace every /o/ with /x/
o -> x
; bodido ==> bxdidx

In this rule, we see every instance of o become x.

10.1Concurrent set

A concurrent set in a change is achieved by listing multiple graphemes in TARGET separated by commas in square brackets, and listing the same amount of resultant graphemes in RESULT separated by commas in square brackets. Changes in a concurrent change execute at the same time:

; Switch /o/ and /e/ around
[o, a] -> [a, o]
; boda ==> bado

Notice that the above example is different to the example below:

  o -> a
a -> o
; boda ==> bodo

where each change is on its own line. We can see o merge with a, then a becomes o.

In the above example, square brackets were used, but because the entire rule was a concurrent set, the square brackets are optional:

; Switch /o/ and /e/ around
o, a -> a, o
; boda ==> bado

10.4Reject

To remove, or in other words, reject a word, you use the ~REJECT keyword in RESULT:

a, bi -> ~REJECT

In the above example, any word that contains a or bi will be rejected.

11.3Word boundary

# matches to word boundaries. Either the beginning of the word if it is in TARGET, or the end of the word if it is in RESULT

  o -> x / p_p#
; opoppop ==> opoppxp

16Insertion and deletion

Insertion requires a condition to be present, and for a tilde ~ to be present in TARGET, representing nothing.

; insert /a/ in between /b/ and /t/
  ~ -> a / b_t
; bt ==> bat

Deletion happens when ~ is present in RESULT

; delete every /b/
  b -> ~
; bubda ==> uda