Lexifer documentation

1About Lexifer

This is the complete documentation for Lexifer version b2.0.1

Lexifer Online is an online application that randomly generates words from a given definition of phonemes, frequencies and word patterns. Applications like Lexifer are called "word generators" or "vocabulary generators".

You can use it to make words for a constructed language, to get an original nickname or password, or just for fun.

2Interface

Use the Examples dropdown button to load a number of example definitions into the file editor
The phonology definition file editor is the main input. It defines the phonology and the word shapes you get from the word generator. There will already be a default phonology definition in the file editor, or your previous phonology definition that you generated words with
Use the Generate button to see Lexifer produce words
Use the Copy button to copy the words to your clipboard

2.1Options

Use the Number of words textbox to choose the number of words to generate. The default number is 100
Word-list mode will produce a list of words
Paragraph mode will produce words that look vaguely like sentences by injecting punctuation into the word list and capitalising the first word of each of sentence
Debug mode will show, line by line, each step in creating each word
Editor wrap lines will make the file editor jump to the next line if the line escapes the width of the file editor
Remove duplicates will make sure all words generated are unique
Force word limit will force the generator to try and generate the complete number of words requested within 30 seconds, despite the number of rejections / duplicates removed
Sort words and Capitalise words should be self explanatory
The Word divider textbox sets the delimiter, or in other words, what the content will be between each word in the output. It is a space ( ) by default. Use \n to get one word for each line

2.3File save / load

Use the Save button to download your phonology definition as a file called 'lexifer.txt', or what you named your file in the File name: field. The file is always a ".txt" type
Use the Load button to load a file on your system into the file editor

3Comments

If a line contains a #, everything after it on that line is ignored. You can use this to leave notes about what something does or why you made certain decisions.

4The with: directory

The first line of the default definition starts with with:. The with directory defines a featureset and engines.

4.1Featuresets

If you have a with: statement, you must use exactly one featureset. Currently, there are two options: std-ipa-features and std-digraph-features. The former is IPA, and the latter is ASCII-friendly. The recognised consonants are as follows:

IPA	Digraph	Features
p	p	voiceless bilabial plosive
b	b	voiced bilabial plosive
ɸ	ph	voiceless bilabial fricative
β	bh	voiced bilabial fricative
f	f	voiceless labiodental fricative
v	v	voiced labiodental fricative
m	m	voiced labial¹ nasal
t	t	voiceless alveolar plosive
d	d	voiced alveolar plosive
s	s	voiceless alveolar sibilant
z	z	voiced alveolar sibilant
θ	th	voiceless alveolar² fricative
ð	dh	voiced alveolar² fricative
ɬ	lh	voiceless alveolar lateral fricative
ɮ	ldh	voiced alveolar lateral fricative
tɬ	tl	voiceless alveolar lateral affricate
dɮ	dl	voiced alveolar lateral affricate
ts	ts	voiceless alveolar affricate
dz	dz	voiced alveolar affricate
ʃ	sh	voiceless postalveolar sibilant
ʒ	zh	voiced postalveolar sibilant
tʃ	ch	voiceless postalveolar affricate
dʒ	j	voiced postalveolar affricate
n	n	voiced alveolar nasal
ʈ	rt	voiceless retroflex plosive
ɖ	rd	voiced retroflex plosive
ʂ	sr	voiceless retroflex sibilant
ʐ	zr	voiced retroflex sibilant
ʈʂ	rts	voiceless retroflex affricate
ɖʐ	rdz	voiced retroflex affricate
ɳ	rn	voiced retroflex nasal
c	ky	voiceless palatal plosive
ɟ	gy	voiced palatal plosive
ɕ	sy	voiceless palatal sibilant
ʑ	zy	voiced palatal sibilant
ç	hy	voiceless palatal fricative
ʝ	yy	voiced palatal fricative
tɕ	cy	voiceless palatal affricate
dʑ	jy	voiced palatal affricate
ɲ	ny	voiced palatal nasal
k	k	voiceless velar plosive
g	g	voiced velar plosive
x	kh	voiceless velar fricative
ɣ	gh	voiced velar fricative
ŋ	ng	voiced velar nasal
q	q	voiceless uvular plosive
ɢ	gq	voiced uvular plosive
χ	qh	voiceless uvular fricative
ʁ	gqh	voiced uvular fricative
ɴ	nq	voiced uvular nasal

¹ These are both bilabial and labiodental. For example, the assimilations engine turns nf into mf and nɸ into mɸ, even though f and ɸ have different places of articulation. ² Yes, the IPA describes these as dental. However, the IPA does not make the dental/alveolar distinction elsewhere, so it is simpler to say that these are alveolar.

Choosing a specific featureset does not mean you have to use it for everything. Rather, you only need to use it for the consonants that will be considered by the engines you use (see below). Any unrecognised segments will be ignored.

4.2Engines

Engines are applied after word generation and before any user defined filters.

std-assimilations

This engine has two behaviours.

The first affects all consonants for which both voiced and voiceless versions exist. It applies leftward assimilation of voicing. For example, it would turn akda into agda and abta into apta.

The second only changes nasals, but considers all consonants except for approximants, lateral approximants, and trills. It applies leftward assimilation of place of articulation. For example, it would turn amta into anta and anka into aŋka.

coronal-metathesis

This engine only affects bilabial, alveolar, and velar plosives and nasals. It ensures that clusters of these segments have the alveolar element last. For example, it would turn atka into akta and anma into amna. It does not metathesise a nasal with a plosive; anpa would not become apna.

5On defining frequency, phonology, and word creation

This is the main purpose of the word generator. It shows how words are initially generated before being modified by any filters and rejections.

5.1Alphabetisation – the letters: directive

If you have a with: directive, there must also be letters:. If not, letters: is optional. letters: tells Lexifer what symbols you use and how to alphabetise them. It also affects how digraphs are parsed, even if std-ipa-features was chosen. For example, consider the following statements:

with: std-ipa-features
letters: t ʃ

In this case, if tʃ occurs, it will not be treated as an affricate tʃ, but as a plosive t followed by a sibilant ʃ. Additionally, words starting with t will be sorted alphabetically above words starting with ʃ. Contrast this with the following statements:

with: std-ipa-features
letters: tʃ t ʃ

In this case, tʃ is treated as an affricate. Additionally, words starting with tʃ will be sorted above words starting with tt, even though t by itself comes before ʃ.

5.2Phoneme classes

These are groupings of phonemes that have one-letter names. For example, here are the classes from the default definition:

C = t n k m ch l ꞌ s r d h w b y p g
D = n l ꞌ t k r p
V = a i e á u o

This creates three groupings. C is the group of all consonants, V is the group of all vowels, and D is a group of some of the consonants. A class cannot contain another class; this is not legal:

C = D m ch s d h w b y g

If you do this, and you have a letters: directive, Lexifer will warn you:

A phoneme class contains 'D' missing from 'letters'. Strange word shapes are likely to result.

By default, the phonemes' frequencies decrease as they go to the right, according to the Gusein-Zade distribution. In the above example, when Lexifer needs to choose a C, it will choose t the most, n the second-most, k the third-most, and so on. If you are not satisfied with the frequencies, you can use a colon (:) to specify the weight for each phoneme, like so:

V = a e i o u
# V has approximately the following probabilities:
# a: 43%, e: 26%, i: 17%, o: 10%, u: 4%
U = a:5 e:4 i:3 o:2 u:1
# U has approximately the following probabilities:
# a: 33%, e: 27%, i: 20%, o: 13%, u: 7%

Weights are relative, so a:5 e:4 i:3 o:2 u:1 is the same as a:50 e:40 i:30 o:20 u:10. Changing the order or weights of phonemes is a good way to change the feel of the language without changing the phonotactics.

If you specify a weight for any phoneme in a class, you must specify the weight for all of them. If you specify a weight of 0, the phoneme will never be selected.

Weights can be fractions, for example: C = t:2.5 k:1 n:0.75

5.3Macros

Macros are a system designed to provide an abbreviation for syllable shapes. They are defined similarly to phoneme classes, but with several important differences:

Every macro's name starts with $. S = s is a phoneme class; $S = s is a macro.
Macros allow phoneme classes inside of them. C = D is not valid, but $C = D works as expected.
Macros do not support multiple possibilities. $M = a b c will not work the way you may think.

The default definition has one macro:

$S = CVD?
words: V?$S$S V?$S V?$S$S$S

This is exactly equivalent to the following definition:

words: V?CVD?CVD? V?CVD? V?CVD?CVD?CVD?

However, since most syllables are CVD?, it is quicker to use a macro.

5.4The random-weight: directive

The random-rate: directive specifies how often optional phonemes or classes are selected. This number is a percentage. For example,

random-rate: 25
words: CVD?

is equivalent to

words: CV:75 CVD:25

The default random-rate is 10%.

5.5Building words

The most common way to make a word is to use the words: directive. Words are weighted similarly to how phonemes are weighted in classes.

A word can consist of individual phonemes, phoneme classes, or a mixture of both.

Phonemes or classes that are optional can be indicated by a ?. For example, words: CVD? is similar to words: CV CVD, although the weights are quite different.

If you choose from the same class twice in a row, you may put an ! after the second one, to indicate they must not be the same phoneme. For example, CC may generate tt, but CC! never will.

By default, words are selected using the Zipf distribution.

5.6Categories

The categories: directive is an alternative to words:. You may not include both directives in the same definition.

categories: lets you define multiple types of words. The general syntax is:

categories: cat1 cat2 # ...etc
cat1 = # word shapes for cat1
cat2 = # word shapes for cat2

The categories themselves can also be weighted, but these weights only apply in paragraph mode. If you give a number of words, that is the number of words generated per category. This is where a weight of 0 could be helpful. If you want to generate parts of a word when you enter a number, but only show complete words in paragraph mode, you could have something like:

categories: root:0 prefix:0 suffix:0 full-word:1
# ...definitions of each category...

The order that the categories are declared is the order they are presented when generating a specific number of words.

6Filters and rejections

Filters and rejections modify or remove words generated from the words: or categories: directive. They are executed in the order they are written.

6.1Filters

Filters are a way to change words after they have been generated and run though the engines in the with: directory. If your spelling doesn't match up with a featureset exactly, you can use filters to achieve this.

Filters are expressed as filter: pattern > replacement. For example, if you want to spell [ŋ] the same as [n], you would say:

filter: ŋ > n

Multiple filters on one line are separated by semicolons:

filter: pattern1 > replacement1; pattern2 > replacement2

This does not mean that the two filters are run at the same time. It is identical to:

filter: pattern1 > replacement1
filter: pattern2 > replacement2

If the replacement is !, the pattern is removed from the word, but the rest of the word is left alone.

6.2Rejections

To outright forbid a sequence from occurring, use the reject: directive. The default definition contains a few of these. The first two are:

reject: wu yi

This prevents any word from having wu or yi. In reality, reject: is an abbreviation, and that statement is equivalent to:

filter: wu > REJECT; yi > REJECT

As such, you can intersperse filters and rejections, and they will be performed in order.

6.3Using Regular Expressions

filter: and reject: use ECMAScript regular expressions. If you know what that means, great; but if not, don't worry about it. The important things are:

^ matches the beginning of the word. reject: ^a would prevent a word from starting with a.
$ matches the end of the word. reject: a$ would prevent a word from ending with a.
(a|b|c) etc match multiple segments. The default phonology definition prevents a word from having a voiceless plosive followed by h by rejecting (p|t|k|ꞌ)h.

If you want to prevent an entire part of a word from appearing twice in a row, you can reject: (..+)\1. This would prevent e.g. kiki from being generated, as it is just ki twice.

If you're confident that it is okay to simplify such occurrences, you may instead filter: (..+)\1+ > $1. This would simplify kiki into simply ki. This may not be desirable as it can make words that are significantly shorter than expected.

If you need to prevent the matching of characters without a combining diacritic to a character with a combining diacritic, you need to use (?=\w|$) after the character. For example filter: o(?=\w|$)x > oy will prevent őx becoming oy.

6.4Cluster fields

Cluster fields are a way to put a lot of related filters or rejections in a smaller space. They are laid out like tables, and start with %. For example, a cluster field could look like:

% a  i  u
a +  +  o
i -  +  uu
u -  -  +

The first character is the row, and the second character is the column. In this example, au becomes o and iu becomes uu. + means to leave the combination as-is, and - means to reject it. This table would permit ai but reject ia.

Cluster fields can also use ! in them to remove a sequence.

As with filters, these are parsed in the order presented. The cluster field ends at a blank line.