machine learning

Statistical unicodification of African languages

Submitted by Guy on Tue, 2011-09-20 10:36

Statistical unicodification of African languages, Scannell, Kevin P. , Language Resources and Evaluation, 09/2011, Volume 45, Issue 3, p.375-386, (2011)

»

Login to post comments
Google Scholar

Automatic Diacritic Restoration for African Languages

Submitted by Guy on Tue, 2007-10-23 12:16

The orthography of many African languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration.

This is a demonstration system for a diacritic restoration method that is able to automatically restore diacritics on the basis of local graphemic context. It is based on the machine learning method of Memory-Based learning. We have applied the method to the African languages of Cilubà, Gĩkũyũ, Kĩkamba, Maa, Sesotho sa Leboa, Tshivenḓa and Yoruba.

You can find more information on this system in this paper

Select a language and enter the word or sentence you want to restore diacritics for.
Cilubà (e.g. mutekete)	Gĩkũyũ (e.g. nituronire)
Kĩkamba (e.g. ningulilikana)	Maasai (e.g. oltunani)
Sesotho sa Leboa (Northern Sotho) (e.g. swanetse)	Tshivenḓa (e.g. tshiswitulo)
Yoruba (e.g. isinku)

[Processing the text might take a while]

Authors:

Guy De Pauw: CNTS - Language Technology Group, University of Antwerp, Antwerp, Belgium, guy [dot] depauw [at] ua [dot] ac [dot] be
Gilles-Maurice de Schryver: African Languages and Cultures, Ghent University, Ghent, Belgium, gillesmaurice [dot] deschryver [at] ugent [dot] be
Peter Waiganjo Wagacha: School of Computing and Informatics, University of Nairobi, Nairobi, Kenya, waiganjo [at] uonbi [dot] ac [dot] ke

»

Login to post comments

Northern Sotho Part-of-Speech Tagger (V2) - Demo

Submitted by Guy on Thu, 2007-10-11 13:05

This demo showcases a part-of-speech tagger for Northern Sotho. It retrieves the morpho-syntactic categories for words in a sentence. It uses MBT, the memory-based tagger trained on a relatively small annotated corpus.

Version1: Ocotober 10 2007 (20k tokens training set)
Version2: December 8 2007 (35k tokens training set)

Type in the text you want to tag (2,500 character limit)
Example: Motho ge a sa tseba o swanetše go dumela seo gore bao ba tsebago ba mmotše.

Authors:

Guy De Pauw: CNTS - Language Technology Group, University of Antwerp, Antwerp, Belgium, guy [dot] depauw [at] ua [dot] ac [dot] be
Gilles-Maurice de Schryver: African Languages and Cultures, Ghent University, Ghent, Belgium, gillesmaurice [dot] deschryver [at] ugent [dot] be

Paper

»

Login to post comments

CNTS - Language Technology Group

Submitted by Guy on Tue, 2006-12-12 15:41

Description:

CNTS is a research center of the Department of Linguistics of the University of Antwerp (UA) in Antwerp, Belgium, engaged in research in computational linguistics and psycholinguistics. The CNTS - Language Technology Group has a strong tradition in the application of machine learning techniques for natural language processing. Recently, CNTS has also started investigating the applicability of unsupervised learning methods and knowledge transfer techniques for the annotation and linguistic description of African languages, particularly Kiswahili and the local languages of Kenya.

URL:

https://www.cnts.ua.ac.be

AfLaT users:

Guy

»

Login to post comments

Kiswahili Part-of-Speech Tagger - Demo

Submitted by Guy on Tue, 2006-12-12 14:58

This demo showcases a broad coverage part-of-speech tagger for Kiswahili. It retrieves the morpho-syntactic categories for words in a sentence. This system uses the Memory-Based Tagger trained on the Helsinki Corpus of Swahili.

Type in the text you want to tag
Example: Hapo ni kwa nini Sahara halina maji na kwa nini simba na shungi.

Authors:

Guy De Pauw: CNTS - Language Technology Group, University of Antwerp, Antwerp, Belgium, guy [dot] depauw [at] ua [dot] ac [dot] be
Gilles-Maurice de Schryver: African Languages and Cultures, Ghent University, Ghent, Belgium, gillesmaurice [dot] deschryver [at] ugent [dot] be
Peter Waiganjo Wagacha: School of Computing and Informatics, University of Nairobi, Nairobi, Kenya, waiganjo [at] uonbi [dot] ac [dot] ke

Paper

»

2 comments

Gĩkũyũ Diacritic Placement - Demo

Submitted by Guy on Tue, 2006-12-12 14:52

The orthography of Gĩkũyũ includes a number of accented characters to represent the entire vowel system (namely ĩ and ũ). Not available on standard computer keyboards, these characters are usually typed as the nearest available characters (i and u).

»

A grapheme-based approach to accent restoration in Gĩkũyũ

Submitted by Guy on Tue, 2006-12-12 14:41

A grapheme-based approach to accent restoration in Gĩkũyũ, Wagacha, Peter W., De Pauw Guy, and Githinji P. W. , Proceedings of the Fifth International Conference on Language Resources and Evaluation, May, 2006, Genoa, Italy, p.1937-1940, (2006)

»

Data-driven part-of-speech tagging of Kiswahili

Submitted by Guy on Tue, 2006-12-12 14:41

Data-driven part-of-speech tagging of Kiswahili, De Pauw, Guy, de Schryver Gilles-Maurice, and Wagacha Peter W. , Proceedings of Text, Speech and Dialogue, 9th International Conference, Volume 4188/2006, Berlin, Germany, p.197-204, (2006)

»

Development of a corpus for Gĩkũyũ using machine learning techniques

Submitted by Guy on Tue, 2006-12-12 14:41

Development of a corpus for Gĩkũyũ using machine learning techniques, Wagacha, Peter W., De Pauw Guy, and Getao K. , Proceedings of LREC workshop - Networking the development of language resources for African languages, Genoa, Italy, (2006)

»

Statistical unicodification of African languages

Automatic Diacritic Restoration for African Languages

Authors:

Northern Sotho Part-of-Speech Tagger (V2) - Demo

Authors:

Paper

CNTS - Language Technology Group

Kiswahili Part-of-Speech Tagger - Demo

Authors:

Paper

Gĩkũyũ Diacritic Placement - Demo

A grapheme-based approach to accent restoration in Gĩkũyũ

Data-driven part-of-speech tagging of Kiswahili

Development of a corpus for Gĩkũyũ using machine learning techniques

User login

Also hosted on AfLaT.org

Register @ aflat.org