Vocabug
documentation

Version 0.0.0

About Vocabug
Interface
1. Options
2. File save / load
Using comments
About graphemes
1. Null grapheme
2. Escaping characters
  1. Word creation character escape
  2. Transform character escape
3. HTML Entities
Categories
1. Long form category name
2. Category-drop-off directive
3. Categories inside categories and category-sets
Assigning weights
Building words
1. Words
  1. Word-drop-off directive
2. Segments
3. Pick-one-set
4. Optional-set
  1. Optional weight
5. Inter-set
  1. Inter-set weight
Alphabetisation and graphs
1. Alphabetisation
2. Defining graphemes
  1. Alternative graphemes
3. Invisibility
Transform
The change
1. Concurrent-set
2. Merging-set
3. Optional-set
4. Reject
The condition
1. Multiple conditions in one rule
2. Optional and concurrent set
3. Word boundary
4. Syllable boundary
5. Word-based conditions
The exception
Using categories
The features directive
1. Feature-field
Wildcard, repetition and positioning
1. Wildcard
2. Ditto-mark
3. Plus-mark
4. Anythings-mark
5. Quantifier
6. Blocker
7. Positioner
Insertion and deletion
Advanced rules
1. Metathesis
Logic blocks
1. If block
2. Chance block
3. Rule macro
Cluster-field
Engine

1About Vocabug

This is the complete documentation for Vocabug, version 0.0.0

Vocabug randomly generates vocabulary from a given definition of graphemes, frequencies and word patterns. You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.

2Interface

The textbox at the top of the program is the definition-build editor. A definition-build defines the graphemes, frequencies, word-shapes and transforms that generate the final words. There will already be a default definition-build in the definition-build editor, or the previous definition-build that generated words
Use the Generate button to see Vocabug produce words
Use the Copy button to copy the words to the clipboard
Use the Clear button to clear the definition-build editor and the generated words

2.1Options

Use the Number of words textbox to choose the number of words to generate. The default number is 100
Word-list mode will produce a list of words
Paragraph mode will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each sentence
Debug mode will show, line by line, each step in creating each word
Editor wrap lines will make the definition-build editor jump to the next line if the line escapes the width of the definition-build editor
Show keyboard will reveal a 'keyboard' below the editor
Remove duplicates will make sure all words generated are unique
Force words will force the generator to try and generate the complete number of words requested within 30 seconds, despite the number of rejections / duplicates removed
Sort words and Capitalise words should be self explanatory
The Word divider textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space " " by default. Use \n to get one word for each line

2.2File save / load

Use the Save button to download the definition-build as a file called 'vocabug.txt', or whatever you named your file in the File name: field. The file is always a ".txt" type
Use the Load button to load a file from your system into the definition-build editor
Use the buttons in the Examples dropdown to load a number of example definition-builds into the definition-build editor

3Using comments

If a line contains a semicolon ; everything after it on that line is ignored and not interpreted as Vocabug syntax -- unless ; is escaped. You can use this to leave notes about what something does or why you made certain decisions.

4About graphemes

Graphemes are indivisible meaningful characters that make a generated word in Vocabug. Phonemes can be thought of as graphemes. If we use English words sky and shy as examples to illustrate this, sky is made up by the graphemes s + k + y, while shy is made up by sh + y.

4.1Null grapheme

If a word is built using the syntax character ^, it will disappear in the generated word. In other words ^ is a null grapheme. If you want to use ^ as a grapheme, you will need to escape it. To use other syntax characters as graphemes, they must be escaped too.

4.2Escaping characters

A single-length character following the syntax character \ ignores any meaning it might have had in the generator, including backslashes themselves. This way, anything including capital letters that have already been defined as categories, brackets, even spaces can be graphemes.

4.2.1Word creation character escape

These are the characters you must escape if you want to use them in categories, segments and the words directive:

Characters	Meaning
`;`	Comment
`{&`, `}`	HTML Entities
`C`, `D`, `Ḱ`, ...	Any one-length character can refer to a category
`{`, `}`	References a long form category name
`,`	Separates choices
	Space, separates choices. An alternative to commas
`$`	Defines a segment
`:`	Gives weight to a grapheme, segment, set, or word-shape
`?`	Gives probability of an optional-set being chosen
`@`	Gives weight of an inter-set being chosen over others
`[`, `]`	Pick-one-set
`(`, `)`	Optional-set
`<`, `>`	Inter-set
`\`	Escapes a character after it

4.2.2Transform character escape

These are the characters you must escape if you want to use them in the transform block:

Characters	Meaning
`;`	Comment
`>`, `->`, `=>`, `⇒`, `→`	Indicates change
`,`	Separates choices
`[`, `]`	Concurrent-set or merging-set
`(`, `)`	Optional-set
`^REJECT`	Rejects a word
`/`	The condition follows this character
`_`	The underscore `_` is a reference to the target
`#`	Word boundary
`$`	Syllable boundary
`!`	The exception follows this character
`{`, `}`	Category or feature-matrix
`*`	Wildcard, matches exactly 1 of any character
`"`	Ditto-mark, matches exactly 1 of the previous character
`+`	Plus-mark, matches 1 or more of the previous character
`~`, `…`	Anythings-mark, matches 1 or more of any character. It is non-greedy
`=[`, `]`	Quantifier
`<[`, `]`	Blocker
`@[`, `]`	Positioner
`^`	Insertion when in `TARGET`, deletion when in `RESULT`
`\|`	Indicates metathesis, and the reordered contents
`1`, `2`, ... `9`	In a metathesis rule, in `RESULT`, these represent the changing graphemes
`\`	Escapes a character after it

4.3HTML entities

Enclosing it in curly brackets { and }, the name, or hex-code of a HTML entity will be decoded at the very end of the definition build. For example {&Agrave} will give À, and {&#x2603} will give {☃}

If using the name of a HTML entity, you must preface it with &. When using the hex-code of a HTML entity, you must preface it with &#. You cannot end the name or hex-code of the entity with a semicolon.

7.4.1Optional weight

This default probability can be modified in two ways. The first is by attaching a percentage-based weight following a ? inside the optional-set:

$S = ta(n, t, l ?30)
words: $S

Now there is a 30% chance of getting one of tan, tat or tal.

7.5Inter-set

An inter-set, using less and greater than signs < and >, works the same as pick-one-set. The difference is, only one inter-set will be chosen for that segment or word-shape.

Inter-set is a feature designed to help generate words with stress or pitch accent systems. Here is an example where it is used for a stress system:

C = t
V = a
$X = (<'>CV)<'>CV
words: $X

This produces any of the following words: 'ta, ta'ta, 'tata, never any words with a double '. Notice here that ta is not possible -- An inter-set set is only chosen after dealing with any sets any Inter-pick-one sets are nested in.

There are a few restrictions and peculiarities to it. Most notably, Inter-set may not be nested inside each other. Let's look at another example:

class-drop-off: flat
words: <a, b><x>

The above example is rather silly, as there is nothing between each Inter-set, defeating its whole purpose. However it is useful as an example here in showing that it is equivalent to the example below, which uses pick-ones instead.

class-drop-off: flat
words: [[a, b],[x]]

In both of the above examples, there is a 25% chance of producing a, a 25% chance of b, and a 50% chance of producing x.

See the "Romance-like" example for a language that uses inter-set for its stress system, or the "BTX" example for a language that uses it for a complex pitch accent system.

7.5.1Inter-set weight

Inter-set weights begin with an @ inside the set. The number of the weight behaves like semicolon weights rather than percentage-based weights. Let's look at a scenario break it down:

class-drop-off: flat
$Y = <a:2, b:1 @3><c>
words: <$Y @8>-<d @2>

In the segment $Y, a and b have a three times greater chance of being chosen over c, while a has a weight that makes it twice as probable than b. In the words: directive, there is one word-shape, and that word-shape has an 80% chance of being the segment Y followed by -, and a 20% chance of the word being -d.

8.2.1Alternative graphemes

The graphs: directive can tell Vocabug which character + combining diacritic sequences are to be treated as alternatives of a base grapheme. Let's name these alternatives the 'children' and the base grapheme the 'parent'. You can do this by enclosing the 'children' in <[ and ] as a set, directly after their 'parent'.

Important: The left-most precomposed character of a 'child' must be the same as its 'parent'.

This should be useful for tonal languages that mark tone with diacritics on vowels. In these tonal languages, we no longer need to list every variation of a vowel + diacritic to capture a vowel:

  graphs: a, <[á, à, ā, ǎ], h, i, <[í, ì, ī, ǐ], k, l, m, n, o, <[ó, ò, ō, ǒ], t
  a -> o
; mápǎ ==> mópǒ

However we can still capture a vowel with a tone mark, such as ǎ:

  ǎ -> o
; mápǎ ==> mápo

8.3Invisibility

Sometimes you will want characters, such as syllable dividers, to be invisible to alphabetisation. You can do this by listing these characters in the invisible-alphabet: directive.

invisible-alphabet: ., ˈ

This would order generated words ˈpa.ta ˈca.ta za.ˈta ca.ˈa as ca.ˈa, ˈca.ta, ˈpa.ta, za.ˈta

6Categories

A category is a set of graphemes with a key. The key is a singular-length capital letter. For example:

C = t, n, k, m, ch, l, ꞌ, s, r, d, h, w, b, y, p, g
F = n, l, ꞌ, t, k, r, p
V = a, i, e, u, o

This creates three groups of graphemes. C is the group of all consonants, V is the group of all vowels, and F is the group of some of the consonants that will be used syllable finally.

These graphemes are separated by commas, however an alternative is to use spaces: C = t n k m ch l ꞌ s r d h w b y p g.

By default, the graphemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Vocabug needs to choose a V, it will choose a the most at 43%, i the second-most at 26%, e the third-most at 17%, u the fourth-most at 10%, and o the fifth most at 4%.

Á Ć É Ǵ Í Ḱ Ĺ Ḿ Ń Ó Ṕ Ŕ Ś Ú Ẃ Ý Ź

ẞ Ä Ë Ḧ Ï Ö Ü Ẅ Ẍ Ÿ

Γ Δ Θ Λ Ξ Π Σ Φ Ψ Ω

6.1Categories inside categories and set-categories

You can use categories inside categories, as long as the referenced category has previously been defined. For example:

default-category-distribution: flat
L = aa, ii, ee, oo
V = a, i, e, o, L

In the example above, V has a 20% chance of being a long vowel.

You can also enclose a set of graphemes in square brackets [ and ]. This is called a 'set-category'. This set will be treated as if it were a reference to a category in terms of frequency. For example, we could write the same example as this:

default-category-distribution: flat
V = a, i, e, o, [aa, ii, ee, oo]

Assigning weights to categories in categories and set-categories is possible.

Categories inside categories and set-categories CANNOT be a part of any sequence. for example C = Xz or C = x[c, d] or C = [a, b][c, d] will not give the results you might want. To get sequence-like behaviour like that, you will need to use segments.

7Building words

7.1Words

The words: directive defines a set of 'word-shapes' that Vocabug will choose from to create words. A word-shape can consist of individual graphemes, categories, segments or a mixture of both.

By default, words are selected using the Zipf distribution. The first word-shape will be chosen the most often, then the second word-shape the second most often and so on. Below is a very simple example that will generate words with one to three CV syllables:

C = t, n, k, m, l, s, r, d, h, w, b, j, p, g
V = a, i, o, e, u
words: CV, CVCV, CVCVCV

7.2Segments

Segments are a system that provides an abbreviation of parts of a word-shape. Typically you would use it to define the shape of a syllable. Segments are defined similarly to categories, but with several important differences:

Every segment's key starts with $. S = s is a category; $S = s is a segment.
Segments are not sets like categories are. $M = a, b, c will not work as you might expect (because as already stated, segments are abbreviation for word-shapes). You would need to use a pick-one-set, i.e: $M = [a, b, c]

For example you could write the last example like so:

$S = CV
words: $S $S$S $S$S$S

You can put segments inside segments.

7.3Pick-one set

A pick-one-set is a group of graphemes and categories separated by spaces or commas, enclosed in square brackets [ and ]. Vocabug will pick an option from that pick-one just like it would from a segment. For example:

V = a, u
words: t[V, x]

This will produce either ta, tu or tx.

Pick-one-sets can be nested inside each other.

Anything inside the pick-one can be assigned a weight, and a pick-one itself can be assigned a weight as well if it is nested inside another set:

words: [a:1, b:2, [c, d]:2]

7.4Optional-set

Using round brackets, ( and ), optional-set works the same way as pick-one-set, the only difference is that what's inside them can either appear in the word or not. The probability of each of these variants is 10% by default.

words: ta(n, t, l)

In the above example, there is a 10% chance of getting one of tan, tat or tal, but a 90% chance of ta.

7.4.1Optionals weight

By default, an optional-set has a 10% chance of being included in the word. You can change this probability.

4Default distributions

The ordering of items matters in categories, segments and word-shapes. The first item will be chosen the most often, the second grapheme the second most often, and so on.

You can change these default distributions (another name for this might be "default drop-off", but I digress). For categories, the default is gusein-zade, and for the separate setting for word-shapes, the default is zipfian. The distribution will be applied to each item in a set, and then recursively to any set that set is nested in (treating the nested set as an item), then applied at the surface level.

A zipfian distribution approximates natural language frequency for words, where the highest-ranked item receives the greatest weight, and subsequent ones decay steeply until flattening out.
A gusein-zade distribution offers a gentler slope that is natural across phonemes in a language, following a logarithmic decay that still prioritizes top-ranked items but spreads weight more evenly
Shallow distribution, the red-headed step-child of the distributions. It doesn't occur in natural linguistics, but offers us something between Flat and Gusein-Zade. It is Zipfian in nature, a 'long-tailed Zipfian distribution'
A flat distribution treats all items equally. This is not to say the items will be evenly chosen -- items are still being randomly chosen on a generation, they just have the same weight

5Assigning weights

If you want to set your own frequency for graphemes in a category or category-set, items in a pick-one-set, or optional-set, or word-shapes in the words: directive, you can use a colon : to specify the weight for each item, like so:

V = a:5, e:4, i:3, o:2, u:1
$S = [V:8 x:2]
words: $S:2 y

V has approximately the following probabilities: a: 33%, e: 27%, i: 20%, o: 13%, u: 7%. The pick-one-set in the $S segment has an 80% chance of producing a V category over the x grapheme. And the first word-shape in the words: directive has twice the chance of being chosen over the next word-shape.

As you might have noticed in the example above, in a sequence that has at least one weighted option, it overwrites any default distributions. Also important to note is that any other option that you had not given a weight (inside that set, or on the surface level), is given a weight of 1.

8Alphabetisation

The alphabet directive gives Vocabug a custom alphabetisation order for words, when the sort words checkbox is selected.

alphabet: a, b, c, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y

This would order generated words like so: cat chat cumin frog tray t'a yanny

9Transform

Once words are generated, you might want to modify them to prevent certain sequences, outright reject certain words, or simulate historical sound changes. This is the purpose of the transform block, which implements the NASC program.

All transforms must be used inside this block. To terminate a block you use an END line. However, all unterminated blocks are automatically terminated at the end of the definition-build:

BEGIN transform:
; Your rules go here
END

A NASC rule can be summarised in four fields: CHANGE / CONDITION ! EXCEPTION. The characters / and ! that precede each field (except for the CHANGE) are necessary for signalling each field. For example, including a ! will signal that this rule contains an exception, and all text following it until the next field marker will be interpreted as such.

Every rule begins on a new line and must contain a CHANGE. The CONDITION or EXCEPTION fields are optional.

If you want to capture graphemes that are normally syntax characters in transforms, you will need to escape them.

When this document uses examples to explain transformations, the last comment shows an example word transforming. For example ; amda ==> ampa means the rule will transform the word amda into ampa

9.1Defining graphemes

The graphemes: directive tells Vocabug which (multi)graphs, including character + combining diacritics, are to be treated as grapheme units when using transformations.

graphemes: a, b, c, ch, e, f, h, i, k, l, m, n, o, p, p', r, s, t, t', y

In the above example, we defined ch as a grapheme. This would stop a rule such as c -> g changing the word chat into ghat, but it will make cobra change into gobra.

10The change

The format of the change can be expressed as TARGET -> RESULT.

TARGET specifies which part of the word is being changed
Then followed by a space and the > character. > can be swapped with either ->, =>, ⇒ or → if you prefer
RESULT is what TARGET is changing into, or in other words, replacing

Let's look at a simple unconditional rule:

; Replace every /o/ with /x/
  o -> x
; bodido ==> bxdidx

In this rule, we see every instance of o become x.

10.1Concurrent set

A concurrent set in a change is achieved by listing multiple graphemes in TARGET separated by commas in square brackets, and listing the same amount of resultant graphemes in RESULT separated by commas in square brackets. Changes in a concurrent change execute at the same time:

; Switch /o/ and /e/ around
  [o, a] -> [a, o]
; boda ==> bado

Notice that the above example is different to the example below:

  o -> a
  a -> o
; boda ==> bodo

where each change is on its own line. We can see o merge with a, then a becomes o.

In the above example, square brackets were used, but because the entire rule was a concurrent set, the square brackets are optional:

; Switch /o/ and /e/ around
  o, a -> a, o
; boda ==> bado

10.2Merging set

A merging change is accomplished by placing graphemes enclosed in square brackets in TARGET, with a corresponding singular grapheme in RESULT that the graphemes in the set will merge into:

; Three graphemes becoming two graphemes
  [ʃ, z], dz -> s, d
; zeʃadzas ==> sesadas

10.3Optional-set

Items in an optional-set can be captured whether or not they appear as part of a grapheme or as part of a sequence of graphemes:

; Merge /x/ and /xw/ into /k/
  x(w) -> k
; xwaxaħa ==> kakaħa

Optional-set can also attach to a concurrent or merging change:

; Merge /x/, /xw/, /ħ/ and /ħw/ into /k/
  [x, ħ](w) -> k
; xwaxaħa ==> kakaka

Looking at the above example, Let's say you wanted to preserve this optional /w/ following /k/ or /ħ/. We can do this by writing this /w/ in RESULT, enclosed by round brackets:

; Like the previous rule, but preserve labialisation
  {x, ħ}(w) -> k(w)
; xwaxaħa ==> kwahaka

The optional-set can also be a merging change, or concurrent change too:

; Like the previous rule, but preserve palatalisation and labialisation 
  [x ħ](w, j) -> k(w, j)
; xwaxjaxa ==> kwakjaka

10.4Reject

To remove, or in other words, reject a word, you use the ^REJECT keyword in RESULT:

a, bi -> ^REJECT

In the above example, any word that contains a or bi will be rejected.

11The condition

Conditions follow the change and are placed after a forward slash. The condition may also be called the environment.

The format of a condition is / BEFORE_AFTER

BEFORE is anything in the word before the target
The underscore _ is a reference to the target in a condition
AFTER is anything in the word after the target

For example:

; Change /o/ into /x/ only when it is between /p/s
  o -> x / p_p
; opoptot ==> opxptot

11.1Multiple conditions in one rule

Multiple conditions for a single rule can be made by separating each condition with additional forward slashes. The change will happen if it meets either, or both of the conditions:

; Change /o/ into /x/ only when it is between /p/s or /t/s
  o -> x / p_p / t_t
; opoptot ==> opxptxt

11.2Optional and concurrent sets

Optional and concurrent sets can be used in conditions:

  a -> e / k(w)_[p, s]
; kwop-po-kos-po ==> kwxp-po-kxs-ko

11.3Word boundary

# matches to word boundaries. Either the beginning of the word if it is in TARGET, or the end of the word if it is in RESULT

  o -> x / p_p#
; opoppop ==> opoppxp

11.4Syllable boundary

$ matches to syllable boundaries. A syllable boundary is either the beginning or end of the word, or any of the symbols defined in the syllable-boundary: directive.

For example:

  syllable-boundary: .
  t$t -> d$d
; at.ta ==> ad.da

11.5Word-based condition

If we wanted to execute a transformation only on a list of words, we simply write those words as a list in a condition without any underscores:

sw -> s / _o / swore, sworn

In the above example, the transformation will only execute if the word is swore or sworn

12The exception

Exceptions are placed following a ! and go after the condition, if there is one. Exceptions function exactly like the opposite of the condition -- they will make sure the content in the exception does not execute a change:

sw -> s / _o ! swore, sworn

In the above example, the transformation will not execute if the word is swore or sworn

13Using categories

You can reference categories in transforms by enclosing a category in curly brackets { and }. The category will behave in the same way as a concurrent or merging set:

  B = x, y, z
  transform:
  {B} -> ^
; xapay ==> apa

14The features directive

Let's say you had the grapheme, or rather, phoneme /i/ and wanted to capture it by its distinctive vowel features, +high and +front, and turn it into a phoneme marked with +high and +back features, perhaps /ɯ/. The features: directive block lets you do this:

Features are defined inside the features block. The features block begins with BEGIN features and terminates with END
A feature prepended with a plus sign + is a 'pro-feature'. For example +voice. In the features block, we can define a set of graphemes that are marked by this feature by using this pro-feature. For example: +voice = b, d, g, v, z
A feature prepended with a minus sign - is an 'anti-feature'. For example -voice. In the features block, we can define a set of graphemes that are marked by a lack of this feature by using this anti-feature. For example: -voice = p, t, k, f, s
Where does this leave graphemes that are not marked by either the pro-feature or the anti-feature of a feature?, you might ask. Such graphemes are unmarked by that feature.
To capture graphemes that are marked by features in a transform, the features must be listed in a 'feature-matrix' using curly brackets { and }. The graphemes in a word must be marked by each pro-/anti-feature in the feature-matrix to be captured. For example if a feature-matrix {+high, +back} captures the graphemes: u, ɯ, another feature-matrix {+high, +back, -round} would capture ɯ only.

The very simple example below is written to change all voiceless graphemes that have a voiced counterpart into their voiced counterparts:

BEGIN features:
  -voice = p, t, k, f, s
  +voice = b, d, g, v, z
END

  {-voice} -> {+voice}
; tamefa ==> dameva

In this rule, in RESULT, {+voice} has a symmetrical one-to-one change of graphemes from the graphemes in {-voice} in TARGET, leading to a concurrent change. Let's quickly imagine a scenario where the only {+voice} grapheme was b. The result will be a merging of all -voice graphemes into b: tamepfa ==> bamebba. Similarly, in a different scenario where the only -voice grapheme was p, p would become the first grapheme in {+voice}, which happens to be b: tamepfa ==> tamebfa

Para-feature

A feature defined without a prepended plus or minus sign is a 'para-feature'. A para-feature is a pro-feature without a listed anti-feature counterpart. Instead, the graphemes marked as the anti-feature are the graphemes in the graphs: directive that are not not marked by the para-feature.

Notice: If there is no graphs: directive in the definition-build, there will be zero anti-feature phonemes. If you define an anti-feature as the counterpart of a para-feature, your anti-feature will be ignored.

graphs: a, b, h, i, k, n, o, t

BEGIN features:
  vowel = a, i, o
END

In the above example, the matrix {-vowel} captures the graphemes b, h, k, n, t

Combining features

We can 'combine' features. Or to be more accurate, a feature's graphemes can mirror the graphemes of other features by defining a feature with features in it. The combined features must be a pro-feature or anti-feature:

BEGIN features:
  labial = p, b, m
  alveolar = t, d, s, l, n
  palatal = j
  velar = k, g
  glottal = h
  consonant = +labial, +alveolar, +palatal, +velar, +glottal
END

14.1Feature-field

Feature-fields allow graphemes to be easily marked by multiple features at the same time.

The feature-field begins with a % followed by a para-feature. Think of this para-feature as the parent feature of the other features in that feature-cluster. The graphemes marked by this para-feature are listed in the first row. The graphemes marked by the anti-feature counterpart are the graphemes in the graphs: directive that are not not marked by the para-feature.
The graphemes being marked by the features are listed on the first row
The features are listed in the first column
A + means to mark the grapheme by that feature's pro-feature
A - means to mark the grapheme by that feature's anti-feature
A . means to leave the grapheme unmarked by that feature

Here is an example of comprehensive features of consonants and vowels:

graphs: a, e, i, o, p, b, t, d, k, g, s, h, l, j, m, n
BEGIN features:
  %consonant m n p b t d k g s h l j
  voice      + + - + - + - + - - + +
  plosive    - - + + + + + + - - - -
  nasal      + + - - - - - - - - - -
  fricative  - - - - - - - - + + - -
  approx     - - - - - - - - - - + +
  labial     + - + + - - - - + + - -
  alveolar   - + - - + + - - - - + -
  palatal    - - - - - - - - - - - +
  velar      - - - - - - + + - - - -
  glottal    - - - - - - - - - + - -

  %vowel a e i o
  high   - - + -
  mid    - + - +
  low    + - - -
  front  - + + -
  back   + - - +
  round  - - - +
END

Here are some matrices of these features and which graphemes they would capture:

{+plosive} captures the graphemes b, d, g, p, t, k
{+voiced, +plosive} captures the graphemes b, d, g
{+voiced, +labial, +plosive} captures the grapheme b
{+vowel} captures the graphemes a, e, i, o
{-vowel} captures the graphemes p, b, t, d, k, g, f, v, s, z, h, l, r, j

Notice a problem that could occur with the above example? The above example has no overlapping features between consonants and vowels, which is fine. But the example below describes a language that has overlapping features between vowels and consonants, namely, syllabic consonants that carry tone. The solution here is to list all phonemes in just one feature-field:

BEGIN features:
  %phoneme   m n p b t d k g s h l j n̩ ń̩ ǹ̩ a á à e é è i í ὶ o ó ὸ
  syllabic   - - - - - - - - - - - - + + + + + + + + + + + + + + +
  vowel      - - - - - - - - - - - - - - - + + + + + + + + + + + +
  high       . . . . . . . . . . . . . . . - - - - - - + + + - - - 
  mid        . . . . . . . . . . . . . . . - - - + + + - - - + + +
  low        . . . . . . . . . . . . . . . + + + - - - - - - - - -
  front      . . . . . . . . . . . . . . . - - - + + + + + + - - - 
  back       . . . . . . . . . . . . . . . + + + - - - - - - + + +
  round      . . . . . . . . . . . . . . . - - - - - - - - - + + +
  low_tone   . . . . . . . . . . . . . . - - - + - - + - - + - - +
  mid_tone   . . . . . . . . . . . . + - - + - - + - - + - - + - -
  high_tone  . . . . . . . . . . . . . . + - + - - + - - + - - + -
  consonant  + + + + + + + + + + + + + + + - - - - - - - - - - - -
  voice      + + - + - + - + - - + + + + + + + + + + + + + + + + +
  plosive    - - + + + + + + - - - - - - . . . . . . . . . . . . .
  nasal      + + - - - - - - - - - - + + . . . . . . . . . . . . .
  fricative  - - - - - - - - + + - - - - . . . . . . . . . . . . .
  approx     - - - - - - - - - - + + - - . . . . . . . . . . . . .
  labial     + - + + - - - - + + - - + - . . . . . . . . . . . . .
  alveolar   - + - - + + - - - - + - - + . . . . . . . . . . . . .
  palatal    - - - - - - - - - - - + - - . . . . . . . . . . . . .
  velar      - - - - - - + + - - - - - - . . . . . . . . . . . . .
  glottal    - - - - - - - - - + - - - - . . . . . . . . . . . . .
END

15Wildcard, repetition and positioning

Wildcards and the like in this section are special tokens that can represent arbitrary amounts of arbitrary graphemes, which is especially useful when you don't know precisely how many, or of what kind of grapheme there will be between two target graphemes in a word.

15.1Wildcard

Wildcard using an astrisk *, will match once to any grapheme. Wildcard does not match word boundaries. Wildcard cannot be used in RESULT:

  a -> e / _*
; apappap ==> apappep

Wildcard can be placed by itself inside an optional-set (*), thereby allowing it to match nothing as well.

15.2Ditto-mark

Ditto-mark using double-quote ", will duplicate once the grapheme, or grapheme from a set, category, or feature, to the left of it. In other words, you can capture an item only when it is geminated using the ditto-mark:

  a" -> o
; aaata => oata

Ditto-mark can be placed by itself inside an optional-set ("), thereby allowing it to match zero copies as well.

15.3Plus-mark

Plus-mark, using +, will match as many (but not zero) times as possible to the grapheme, or grapheme in a set, category, or feature, to the left of it.

  a+ -> o
; raraaaaa ==> raro

; [p,t,k]+ -> [b,d,g]
; atppakkka ==> atbaga

Plus-mark can be placed by itself inside an optional-set (+), thereby allowing it to match zero copies as well.

You may want to match zero, once, or as many times as possible to the grapheme, or grapheme in a set, category, or feature, to the left of it, known occasionally as a Kleene-star. To do this, you need to enclose the grapheme, set, category, or feature and the plus-mark in an optional-set, and the plus-mark by itself inside another optional-set:

; /i/ followed by either 0, 1, or more /a/ becomes /e/
  i(a(+)) -> e
; ririaaaaa ==> rere

15.4Anythings-mark

The anythings-mark uses tilde ~ or the ellipsis character … U+2026. It will match as many (but not zero) times to any grapheme as needed. For example:

  b~t -> x
; babãitto => xto

As we can see, the rule matched b followed by anything else until it reached the first t, then stopped matching. Why did the anythings-mark not continue matching t and beyond like *+ would? This is because it is non-greedy, or in other words, lazy. The anythings-mark will continue matching graphemes until a grapheme that would be matched matches an item following the anythings-mark.

The example below uses an optional anythings-mark in the condition:

; Simulate spreading of nasality to vowels
  [a, i, u] -> [ã, ĩ, ũ] / [ã, ĩ, ũ](…)_ 
; babãittati => babãĩttãtĩ

15.5Quantifier

The quantifier matches as many times its digit(s), enclosed in =[ and ], to the things to the left.

  Change /o/ into /x/ only when preceded by three /r/s
  o -> x / r=[3]_
; ororrro ==> ororrrx

The digits in the quantifier can also be a list:

  Change a sequence of 2 or 4 /o/s into /x/
  o -> x / r=[2, 4]_
; toootoooo ==> txotx

The numbers in the quantifier can also be a range of numbers. To do this, put a : between the lowest and highest range. (The ranges must be in the order of lowest to highest):

  Change a sequence of 2 to 4 /o/s into /x/
  o -> x / r=[2:4]_
; toootoooo ==> txtx

Using the + symbol

If you use a + in a list between two lower and higher numbers, it will represent all the numbers between the two digits:

  Change a sequence of 2 to 10 /o/s into /x/
  o -> x / r=[2, +, 10]_
; toootoooo ==> txtx

At the beginning of the list, + represents all the possible numbers lower than the number to the right, not including zero.

  Change a sequence of 1 to 10 /o/s into /x/
  o -> x / r=[+, 10]_
; toootoooo ==> txtx

And finally at the end of the list, + represents all possible numbers larger than the number to the the left

  Change a sequence of 4 to as many as possible /o/s into /x/
  o -> x / r=[4, +]_
; toootooooo ==> toootx

Here is a useful lookup table on getting quantities of ditto-marks or wildcards:

	Wildcard	Ditto-mark
Exactly 1 of	`*`	`"`
0 or 1 of	`(*)`	`(")`
1 or more of	`~`	`+`
0, 1, or more of	`(~)`	`(+)`
Specific number(s) of	`*=[N]`	`"=[N]`
Number range(s) of	`*=[N:N]`	`"=[N:N]`

15.6Blocker

Blocker is designed to block the spread of greedy, spreading, behaviour of the anythings-mark. You enclose a set of graphemes inside <[ and ] that will block spreading. For example we might want the graphemes k or g to prevent the rightward spread of nasal vowels to non nasal vowels:

  [a, i, u] -> [ã, ĩ, ũ] / [ã, ĩ, ũ](~)<[k, g]_
; pabãdruliga ==> pabãdrũlĩga

15.7Positioner

Positioners, enclosed in @[ and ], allows a grapheme to the left of it to be captured only when it is the Nth in the word:

; Change the second /o/ in a word to /x/ after the second /s/
  o@[2] -> x / s@[2]_
; sososo ==> sosxso

If we want to match the last occurence of a grapheme in a word, use -1. For the second last occurence of a grapheme in a word, use -2, and so forth:

; Change the last /o/ in a word to /x/
  o@[-1] -> x
; sososo ==> sososx

The numbers in the positioner can also be a list of numbers:

; Change the first and third /o/ in a word to /x/
  o@[1, 3] -> x
; sososo ==> sxsosx

The number in the positioner can also be a range. To do this, put a : between the lowest and highest range:

; Change the first to third /o/ in a word to /x/
  o@[1:3] -> x
; sososoo ==> sxsxsxo

16Insertion and deletion

Insertion requires a condition to be present, and for a caron ^ to be present in TARGET, representing nothing.

; insert /a/ in between /b/ and /t/
  ^ -> a / b_t
; bt ==> bat

Deletion happens when ^ is present in RESULT

; delete every /b/
  b -> ^
; bubda ==> uda

17Advanced rules

17.1Metathesis

Metathesis in NASC refers to the reordering of graphemes in a word. Metathesis in real-world diachronics is usually sporadic, but can be regular.

To make a rule a metathesis rule, use these symbols:

The pipe | marks the content (if any) between the targets we want to reorder. You must use the same amount of |s in TARGET as in RESULT
Numbers in RESULT refer to the targets. Reordering these numbers reorders the targets. It is possible to have up to nine
Underscores _ in a condition or exception, are references to the targets. Unlike a normal rule, we can have multiple

Local metathesis

A typical type of metathesis is local two-place metathesis:

; An intervocalic stop + nasal sequence becomes nasal + stop
  [stop]|[nasal] -> 2|1 / V__V 
; watna ==> wanta

Long-distance metathesis

The example below approximates metathesis that occured in Spanish:

r|l -> 2|1 / _(…)[plosive]_
; parabla ==> palabra

One-place metathesis

To simulate one-place metathesis, move |s.

The example below is metathesis where words beginning with stop + vowel will try and move an r in a stop + r cluster to form a word initial stop + r cluster:

{stop}|r -> 12| / #_{vowel}…{stop}_ 
; kabatros ==> krabatos

Metathesis madness

Three or more items, to a maximum of 9, switching places, are possible, with shuffling of any |:

  x|y|z -> ||321
; xaayooz ==> aaoozyx

18Logic blocks

Logic blocks are a way of executing transformations depending on a trigger event that we are listening for.

18.1If block

Using an If block, You can make transformations execute on a word if, or if not, other transformation(s) were applied to the word.

It should feel familiar to anyone who knows a bit about programming languages

BEGIN if: starts the if block and where transforms will be listened to and trigger other events on the word if, or if not, it is executed on that word.
then: is where you put transforms that will execute if the transformations in if: did apply
else: is is where you put transforms that will execute if the transformations in if: did not meet a CONDITION or were blocked by an EXCEPTION
END is the end of the block

For example:

BEGIN if:
  ; Deletion of schwa before r
  ə -> ^ / _r
then:
  ; Then do metathesis of r and l
  r|l -> 2|1 / _|[plosive]_
else:
  ; Schwa becomes e if the first rule did not apply
  ə -> e
END

Note: The above example is actually quite bogus if it were a historical sound change. Sound change in natural diachronics has no memory. We can have "two-part" sound-changes such as this triggered metathesis, but a sound change executing on a word because another sound change did not apply to the word does not occur, at least not in real-life natural human languages.

18.2Chance block

The chance block is a way to apply transformations depending on percentage-based chance:

BEGIN chance 15:
  a -> e
END

In the above example we have a 15% chance of words with an a in them such as pa becoming pe

18.3Rule macro

Rule macro saves rules to be used later in the definition-build as many times as needed. The rules inside the define-rule-macro: block do not run until invoked using do-rule-macro::

BEGIN def-rule-macro resyllabify:
  i -> j / _[a,e,o,u]
  u -> w / _[a,e,i,o]
END

  do-rule-macro: resyllabify
  ʔ -> ^
  do-rule-macro: resyllabify
; iaruʔitua ==> jaruʔitwa ==> jaruitwa  ==> jarwitwa

In the above example we saved two rules as a macro under the name "resyllabify" and used that macro twice.

19Cluster-field

Cluster-fields are a way to target and change sequences of graphemes. They are laid out like tables, and start with %. For example:

% a  i  u
a +  +  o
i -  +  uu
u -  -  +

The first grapheme is the row, and the second grapheme is the column. In this example, au becomes o and iu becomes uu. + means to leave the combination as-is, and - means to reject the word. This table would permit ai but reject ia.

Cluster-fields can also use ^ in them to remove a sequence.

As with filters, these are parsed in the order presented. The cluster-field ends at a blank line or the end of the definition-build.

20Engine

The engine statement provides useful functions that you can call at any point in the definition-build. You can also call a list of these functions in one line e.g: engine: compose, capitalise

decompose will break-down all characters in a word into their "Unicode Normalization, Canonical Decomposition" form. For example, ñ as a singular unicode entity, \u00F1, will be broken-down into a sequence of two characters, n \u006E + ◌̃ \u0303. The typescript function is called Normalize("NFD")
compose does the opposite of decompose. It converts all characters in a word to the "Unicode Normalization, Canonical Decomposition followed by Canonical Composition" form. For example, ñ as two characters \u006E\u0303, will be transformed into one character, \u00F1. The typescript function is called Normalize("NFC")
capitalise will convert the first character of a word to uppercase
de-capitalise will convert the first character of a word to lowercase
to-upper-case will convert all characters of a word to uppercase
to-lower-case will convert all characters of a word to lowercase
xsampa_to_ipa will convert graphemes of a word written in X-SAMPA into IPA
ipa_to_xsampa will convert graphemes of a word written in IPA into X-SAMPA

Vocabugdocumentation

Contents

1About Vocabug

2Interface

2.1Options

2.2File save / load

3Using comments

4About graphemes

4.1Null grapheme

4.2Escaping characters

4.2.1Word creation character escape

4.2.2Transform character escape

4.3HTML entities

7.4.1Optional weight

7.5Inter-set

7.5.1Inter-set weight

8.2.1Alternative graphemes

8.3Invisibility

6Categories

6.1Categories inside categories and set-categories

7Building words

7.1Words

7.2Segments

7.3Pick-one set

7.4Optional-set

7.4.1Optionals weight

4Default distributions

5Assigning weights

8Alphabetisation

9Transform

9.1Defining graphemes

10The change

10.1Concurrent set

10.2Merging set

10.3Optional-set

10.4Reject

11The condition

11.1Multiple conditions in one rule

11.2Optional and concurrent sets

11.3Word boundary

11.4Syllable boundary

11.5Word-based condition

12The exception

13Using categories

14The features directive

14.1Feature-field

15Wildcard, repetition and positioning

15.1Wildcard

15.2Ditto-mark

15.3Plus-mark

15.4Anythings-mark

15.5Quantifier

15.6Blocker

15.7Positioner

16Insertion and deletion

17Advanced rules

17.1Metathesis

18Logic blocks

18.1If block

18.2Chance block

18.3Rule macro

19Cluster-field

20Engine

Vocabug
documentation