Lexifer documentation
Contents
- About Lexifer
- Interface
- On using comments
- The with: directory
- On defining frequency, phonology, and word creation
- On filtering and rejecting words
1About Lexifer
This is the complete documentation for Lexifer version b2.0.1
Lexifer Online is an online application that randomly generates words from a given definition of phonemes, frequencies and word patterns. Applications like Lexifer are called "word generators" or "vocabulary generators".
You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.
This version of Lexifer is a fork of this, which is a TypeScript version of Lexifer written by u/bbrk24. Software Copyright (c) 2021-2022 bbrk24Copyright (c), 2006-2023 William S. Annis.
2Interface
- Use the
Examples
dropdown button to load a number of example definitions into the file editor - The phonology definition file editor is the main input. It defines the phonology and the word shapes you get from the word generator. There will already be a default phonology definition in the file editor, or your previous phonology definition that you generated words with
- Use the
Generate
button to see Lexifer produce words - Use the
Copy
button to copy the words to your clipboard
2.1Options
- Use the
Number of words
textbox to choose the number of words to generate. The default number is 100 Word-list mode
will produce a list of wordsParagraph mode
will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each of sentenceDebug mode
will show, line by line, each step in creating each wordEditor wrap lines
will make the file editor jump to the next line if the line escapes the width of the file editorRemove duplicates
will make sure all words generated are uniqueForce word limit
will force the generator to try and generate the complete number of words requested within 30 seconds, despite the number of rejections / duplicates removedSort words
andCapitalise words
should be self explanatory- The
Word divider
textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space (\n
to get one word for each line
2.3File save / load
- Use the
Save
button to download your phonology definition as a file called 'lexifer.txt', or what you named your file in theFile name:
field. The file is always a ".txt" type - Use the
Load
button to load a file on your system into the file editor
3Comments
If a line contains a #
, everything after it on that line is ignored. You can use this to leave notes about what something does or why you made certain decisions.
4The with: directory
The first line of the default definition starts with with:
. The with directory defines a featureset and engines.
4.1Featuresets
If you have a with:
statement, you must use exactly one featureset. Currently, there are two options: std-ipa-features
and std-digraph-features
. The former is IPA, and the latter is ASCII-friendly. The recognised consonants are as follows:
IPA | Digraph | Features |
---|---|---|
p | p | voiceless bilabial plosive |
b | b | voiced bilabial plosive |
ɸ | ph | voiceless bilabial fricative |
β | bh | voiced bilabial fricative |
f | f | voiceless labiodental fricative |
v | v | voiced labiodental fricative |
m | m | voiced labial¹ nasal |
t | t | voiceless alveolar plosive |
d | d | voiced alveolar plosive |
s | s | voiceless alveolar sibilant |
z | z | voiced alveolar sibilant |
θ | th | voiceless alveolar² fricative |
ð | dh | voiced alveolar² fricative |
ɬ | lh | voiceless alveolar lateral fricative |
ɮ | ldh | voiced alveolar lateral fricative |
tɬ | tl | voiceless alveolar lateral affricate |
dɮ | dl | voiced alveolar lateral affricate |
ts | ts | voiceless alveolar affricate |
dz | dz | voiced alveolar affricate |
ʃ | sh | voiceless postalveolar sibilant |
ʒ | zh | voiced postalveolar sibilant |
tʃ | ch | voiceless postalveolar affricate |
dʒ | j | voiced postalveolar affricate |
n | n | voiced alveolar nasal |
ʈ | rt | voiceless retroflex plosive |
ɖ | rd | voiced retroflex plosive |
ʂ | sr | voiceless retroflex sibilant |
ʐ | zr | voiced retroflex sibilant |
ʈʂ | rts | voiceless retroflex affricate |
ɖʐ | rdz | voiced retroflex affricate |
ɳ | rn | voiced retroflex nasal |
c | ky | voiceless palatal plosive |
ɟ | gy | voiced palatal plosive |
ɕ | sy | voiceless palatal sibilant |
ʑ | zy | voiced palatal sibilant |
ç | hy | voiceless palatal fricative |
ʝ | yy | voiced palatal fricative |
tɕ | cy | voiceless palatal affricate |
dʑ | jy | voiced palatal affricate |
ɲ | ny | voiced palatal nasal |
k | k | voiceless velar plosive |
g | g | voiced velar plosive |
x | kh | voiceless velar fricative |
ɣ | gh | voiced velar fricative |
ŋ | ng | voiced velar nasal |
q | q | voiceless uvular plosive |
ɢ | gq | voiced uvular plosive |
χ | qh | voiceless uvular fricative |
ʁ | gqh | voiced uvular fricative |
ɴ | nq | voiced uvular nasal |
¹ These are both bilabial and labiodental. For example, the assimilations engine turns nf into mf and nɸ
into mɸ, even though f
and ɸ
have different places of articulation. ² Yes, the IPA describes these as dental. However, the IPA does not make the dental/alveolar distinction elsewhere, so it is simpler to say that these are alveolar.
Choosing a specific featureset does not mean you have to use it for everything. Rather, you only need to use it for the consonants that will be considered by the engines you use (see below). Any unrecognised segments will be ignored.
4.2Engines
Engines are applied after word generation and before any user defined filters.
std-assimilations
This engine has two behaviours.
The first affects all consonants for which both voiced and voiceless versions exist. It applies leftward assimilation of voicing. For example, it would turn akda
into agda
and abta
into apta
.
The second only changes nasals, but considers all consonants except for approximants, lateral approximants, and trills. It applies leftward assimilation of place of articulation. For example, it would turn amta
into anta
and anka
into aŋka
.
coronal-metathesis
This engine only affects bilabial, alveolar, and velar plosives and nasals. It ensures that clusters of these segments have the alveolar element last. For example, it would turn atka
into akta
and anma
into amna
. It does not metathesise a nasal with a plosive; anpa
would not become apna
.
5On defining frequency, phonology, and word creation
This is the main purpose of the word generator. It shows how words are initially generated before being modified by any filters and rejections.
5.1Alphabetisation – the letters: directive
If you have a with:
directive, there must also be letters:
. If not, letters:
is optional. letters:
tells Lexifer what symbols you use and how to alphabetise them. It also affects how digraphs are parsed, even if std-ipa-features
was chosen. For example, consider the following statements:
with: std-ipa-features letters: t ʃ
In this case, if tʃ
occurs, it will not be treated as an affricate tʃ, but as a plosive t followed by a sibilant ʃ. Additionally, words starting with t will be sorted alphabetically above words starting with ʃ. Contrast this with the following statements:
with: std-ipa-features letters: tʃ t ʃ
In this case, tʃ
is treated as an affricate. Additionally, words starting with tʃ will be sorted above words starting with tt, even though t by itself comes before ʃ.
5.2Phoneme classes
These are groupings of phonemes that have one-letter names. For example, here are the classes from the default definition:
C = t n k m ch l ꞌ s r d h w b y p g D = n l ꞌ t k r p V = a i e á u o
This creates three groupings. C
is the group of all consonants, V
is the group of all vowels, and D
is a group of some of the consonants. A class cannot contain another class; this is not legal:
C = D m ch s d h w b y g
If you do this, and you have a letters:
directive, Lexifer will warn you:
A phoneme class contains 'D' missing from 'letters'. Strange word shapes are likely to result.
By default, the phonemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Lexifer needs to choose a C
, it will choose t
the most, n
the second-most, k
the third-most, and so on. If you are not satisfied with the frequencies, you can use a colon (:
) to specify the weight for each phoneme, like so:
V = a e i o u # V has approximately the following probabilities: # a: 43%, e: 26%, i: 17%, o: 10%, u: 4% U = a:5 e:4 i:3 o:2 u:1 # U has approximately the following probabilities: # a: 33%, e: 27%, i: 20%, o: 13%, u: 7%
Weights are relative, so a:5 e:4 i:3 o:2 u:1
is the same as a:50 e:40 i:30 o:20 u:10
. Changing the order or weights of phonemes is a good way to change the feel of the language without changing the phonotactics.
If you specify a weight for any phoneme in a class, you must specify the weight for all of them. If you specify a weight of 0
, the phoneme will never be selected.
Weights can be fractions, for example: C = t:2.5 k:1 n:0.75
5.3Macros
Macros are a system designed to provide an abbreviation for syllable shapes. They are defined similarly to phoneme classes, but with several important differences:
- Every macro's name starts with
$
.S = s
is a phoneme class;$S = s
is a macro. - Macros allow phoneme classes inside of them.
C = D
is not valid, but$C = D
works as expected. - Macros do not support multiple possibilities.
$M = a b c
will not work the way you may think.
The default definition has one macro:
$S = CVD? words: V?$S$S V?$S V?$S$S$S
This is exactly equivalent to the following definition:
words: V?CVD?CVD? V?CVD? V?CVD?CVD?CVD?
However, since most syllables are CVD?
, it is quicker to use a macro.
5.4The random-weight: directive
The random-rate:
directive specifies how often optional phonemes or classes are selected. This number is a percentage. For example,
random-rate: 25 words: CVD?
is equivalent to
words: CV:75 CVD:25
The default random-rate is 10%.
5.5Building words
The most common way to make a word is to use the words:
directive. Words are weighted similarly to how phonemes are weighted in classes.
A word can consist of individual phonemes, phoneme classes, or a mixture of both.
Phonemes or classes that are optional can be indicated by a ?
. For example, words: CVD?
is similar to words: CV CVD
, although the weights are quite different.
If you choose from the same class twice in a row, you may put an !
after the second one, to indicate they must not be the same phoneme. For example, CC
may generate tt
, but CC!
never will.
By default, words are selected using the Zipf distribution.
5.6Categories
The categories:
directive is an alternative to words:
. You may not include both directives in the same definition.
categories:
lets you define multiple types of words. The general syntax is:
categories: cat1 cat2 # ...etc cat1 = # word shapes for cat1 cat2 = # word shapes for cat2
The categories themselves can also be weighted, but these weights only apply in paragraph mode. If you give a number of words, that is the number of words generated per category. This is where a weight of 0 could be helpful. If you want to generate parts of a word when you enter a number, but only show complete words in paragraph mode, you could have something like:
categories: root:0 prefix:0 suffix:0 full-word:1 # ...definitions of each category...
The order that the categories are declared is the order they are presented when generating a specific number of words.
6Filters and rejections
Filters and rejections modify or remove words generated from the words:
or categories:
directive. They are executed in the order they are written.
6.1Filters
Filters are a way to change words after they have been generated and run though the engines in the with:
directory. If your spelling doesn't match up with a featureset exactly, you can use filters to achieve this.
Filters are expressed as filter: pattern > replacement
. For example, if you want to spell [ŋ] the same as [n], you would say:
filter: ŋ > n
Multiple filters on one line are separated by semicolons:
filter: pattern1 > replacement1; pattern2 > replacement2
This does not mean that the two filters are run at the same time. It is identical to:
filter: pattern1 > replacement1 filter: pattern2 > replacement2
If the replacement is !
, the pattern is removed from the word, but the rest of the word is left alone.
6.2Rejections
To outright forbid a sequence from occurring, use the reject: directive. The default definition contains a few of these. The first two are:
reject: wu yi
This prevents any word from having wu or yi. In reality, reject: is an abbreviation, and that statement is equivalent to:
filter: wu > REJECT; yi > REJECT
As such, you can intersperse filters and rejections, and they will be performed in order.
6.3Using Regular Expressions
filter:
and reject:
use ECMAScript regular expressions. If you know what that means, great; but if not, don't worry about it. The important things are:
^
matches the beginning of the word.reject: ^a
would prevent a word from starting with a.$
matches the end of the word.reject: a$
would prevent a word from ending with a.(a|b|c)
etc match multiple segments. The default phonology definition prevents a word from having a voiceless plosive followed by h by rejecting(p|t|k|ꞌ)h
.
If you want to prevent an entire part of a word from appearing twice in a row, you can reject: (..+)\1
. This would prevent e.g. kiki from being generated, as it is just ki twice.
If you're confident that it is okay to simplify such occurrences, you may instead filter: (..+)\1+ > $1
. This would simplify kiki into simply ki. This may not be desirable as it can make words that are significantly shorter than expected.
If you need to prevent the matching of characters without a combining diacritic to a character with a combining diacritic, you need to use (?=\w|$)
after the character. For example filter: o(?=\w|$)x > oy
will prevent őx
becoming oy
.
6.4Cluster fields
Cluster fields are a way to put a lot of related filters or rejections in a smaller space. They are laid out like tables, and start with %
. For example, a cluster field could look like:
% a i u a + + o i - + uu u - - +
The first character is the row, and the second character is the column. In this example, au
becomes o and iu
becomes uu. +
means to leave the combination as-is, and -
means to reject it. This table would permit ai
but reject ia
.
Cluster fields can also use !
in them to remove a sequence.
As with filters, these are parsed in the order presented. The cluster field ends at a blank line.