UPSID Info This site is a (hopefully) simple user interface to the UCLA Phonological Segment Inventory Database (UPSID).

This Database was compiled by Ian Maddieson and Kristin Precoda (cf. Maddieson, 1984) and contains information on the distribution of 919 different segments in 451 languages. Henning Reetz took the original data from a ZIP-file and added the HTML-interface you see. Only few typographical changes have been made.



Some terms used:

Segment frequency:
This is the number of languages that contains a specific segment divided by the number of languages in UPSID expressed in percent. For example, a segment that is only found in one language has a frequency of (1 / 451) * 100 = 0.22 (or, in other words, it only exists in 0.2% of all languages in UPSID). The most frequent segment in UPSID is the bilabial nasal /m/, which occurs in 425 languages and hence its segment frequency is 94.2%. There are 919 different segments in the database and the complete list of all frequencies is rather long. The 20 most frequent consonants and the 10 most frequent vowels are:

consonant: m k j p w b h g N ? n s tS S t f l "n "t nj
in languages: 425 403 378 375 332 287 279 253 237 216 202 196 188 187 181 180 174 160 152 141
frequency: 94.2 89.4 83.8 83.2 73.6 63.6 61.9 56.1 52.6 47.9 44.8 43.5 41.7 41.5 40.1 39.9 38.6 35.5 33.7 31.3


vowel: i a u E "o "e O o e a~
in languages: 393 392 369 186 181 169 162 131 124 83
frequency: 87.1 86.9 81.8 41.2 40.1 37.5 35.9 29.0 27.5 18.4


At the other end of the scale there are many segments that occur in one or only few languages:

Number of segments: 427 117 66 39 27 19 14 14 12 13  
that occur only in 1 2 3 4 5 6 7 8 9 10 languages
% of all segments: 46.46 12.73 7.18 4.24 2.94 2.07 1.52 1.52 1.31 1.41  
cummulative %: 46.46 59.19 66.38 70.62 73.56 75.63 77.15 78.67 79.98 81.39  

That is, the group of sounds that appear in 10 or fewer of the 451 languages make up more than 80% of the 919 sounds in the database.


Number of segments in a language:
This is simply the number of segments that are in a language according to the UPSID database.
The histogram below shows the distribution of the number of segments across the 451 languages in UPSID.

nr_seg graph

min 2.5% 10% 25% median mean 75% 90% 97.5% max.
11 16 20 23 29 30.97 36 43 58 141

There are actually two languages with 11 and one with 141 segments, as can be seen in the respective list.


Frequency index:
This number is the arithmetic average of the segment frequencies of a language. A language with mostly rare segments will have a low frequency index, whereas a language with mostly common sounds will have a high frequency index. A frequency index of 0.1 means that a language has many very rare segments; 0.7 means it has many common segments; the average frequency index of all languages is 0.39.

The histogram below shows the distribution of the frequency indices in UPSID.

frq_ind graph

min 2.5% 10% 25% median mean 75% 90% 97.5% max.
.1057 .2044 .2663 .3300 .3891 .3909 .4520 .5147 .5785 .6562


Note that there is a relation between frequency index and number of segments in a language. That is, if a language has only few segments, it is likely that these are rather common in the languages in UPSID. On the other hand, a language with many segments will also have many segments that are uncommon in the UPSID database. This does not necessarily mean that certain sounds are more natural but it is a probabilistic effect: if you make a pot with many red marbles, few green marbles, and other marbles with different colors and you draw a small random sample (i.e. 10 marbles) you will have mostly red marbles. If you draw a large random sample (e.g. 100 marbles) you will have many single colored ones.

The scatterplot below shows the relation between the frequency index and the logarithm of the number of segments in a language (the formula of the curve is "Freq. = 1.2282298 - 0.2479315 Log(nr_seg)" with an RSquare = 0.718 of the fit.

freq_vs_nrseg graph


Report a bug