How to Build a Global Tree of All Known Languages? - A Brief Demonstration.
By Julien d'Huy
[I] (https://unive-paris1.academia.edu/JuliendHuy) used the [Dedius’s binary coding (with MrBayes)] (http://rspb.royalsocietypublishing.org/content/royprsb/suppl/2010/08/27/rspb.2010.1595.DC1/rspb20101595supp1.pdf) of the [WALS dataset] (http://wals.info/), the softwares [Structure 2.3.4] (http://pritchardlab.stanford.edu/structure.html) (fig.1, see supplementary material), and [StructureHarvester] (http://taylor0.biology.ucla.edu/structureHarvester/) (fig.2) to estimate the most likelihood number (“K”) of founding populations (here, K = 20) needed to explain the structure of the corpus. For each cultural area, the Fst score oscillates between 42–88%, [which is much higher than the genetic variation among human populations] (http://www.pnas.org/content/106/42/17671.short), and suggests that the language families, once established, seem to have been very conservative. Then I excluded from the corpus each language with more than 30% data which refer to acculturation signal (i.e. less than 70% of single-origin). Indeed, these languages having a complex history, it is expected that they are intermediate between many others and “blur” the phylogenetic signal. Using the altered corpus, I created many trees and networks (fig.3-5), which correctly group, for the first time, all languages into known language families and show evidence for some higher level clusters. The Austronesian language family may not form a monophyletic group.
This idea, which has yet to be tested in larger datasets, may be a major step toward a complete understanding of the historical relationship between the world's languages and many other important questions, such as the reconstruction of the human proto-language's structure and its evolution.
Attachment: Electronic_supplementary_material.pdf (286 KB)