Language Technologies Unit Blog, Canolfan Bedwyr

Cronfa Electroneg o Gymraeg (CEG)

A 1 million word lexical database and frequency count for Welsh

Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001)

Brief Summary
Background
File formats and Character coding conventions
Description of the text files
The Raw and Tagged Datafiles
Data quality
Counts of Raw Word Forms
Lemma Counts with analyses of inflections and mutations
Download Word Form files
Contact Information
Use of these Materials

Brief Summary

This is a word frequency analysis of 1,079,032 words of written Welsh prose, based on 500 samples of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) Welsh prose writing. It was conceived as providing a Welsh parallel to the Kucera and Francis analysis for American English, and the LOB corpus for British English, in the expectation that such an analysed corpus would provide research tools for a number of academic disciplines: psychology and psycholinguistics, child and second language acquisition, general linguistics, and the linguistics of Modern Welsh, including literary analysis.

The sample included materials from the fields of novels and short stories, religious writing, children铆s literature both factual and fiction, non-fiction materials in the fields of education, science, business, leisure activities, etc., public lectures, newspapers and magazines, both national and local, reminiscences, academic writing, and general administrative materials (letters, reports, minutes of meetings).

The resultant corpus was analysed to produce frequency counts of words both in their raw form and as counts of lemmas where each token is demutated and tagged to its root. This analysis also derives basic information concerning the frequencies of different word classes, inflections, mutations, and other grammatical features.

Articles based on the use of the database should cite:

Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]

Available: www.bangor.ac.uk/canolfanbedwyr/ceg.php.en

(Top of Page)

Background

This project was funded for the academic year 1993-94 by a grant of 拢21K from the Higher Education Funding Council for Wales to Ellis, O'Dochartaigh & Hicks of the Welsh IT Unit and the School of Psychology, University of Wales, 香港六合彩挂牌资料. The researchers began work on the project in October 1993, and after the sample range had been identified in collaboration with Professor Gwyn Thomas of the Department of Welsh, proceeded to collect the required range of texts. The original intention was that this range of materials would be acquired in an electronic form from Welsh language publishers and other bodies, such as local authorities, governmental organizations, and papurau bro (locally produced newspapers). However, it proved to be impossible to collect the necessary breadth of materials in an electronic form, primarily because at that time Welsh language publishers did not generally keep computer-based archive copies of books which they may have published using electronic means.

Under these circumstances, having acquired around 200 usable samples from various bodies, it was decided to input the remainder by using both typists and an OCR system. The task of checking such typed copy, and in particular of correcting the errors introduced by the OCR software, was carried out by the researcher, assisted by the on-going development of the Welsh spelling-checker, CySill. The additional costs of this work were borne by funding from the Welsh IT Unit at 香港六合彩挂牌资料.

Where material was obtained directly from publishers or from individual authors, permission was sought for the data to be included in the project analysis, with the understanding that if they were ever to be made available to a wider audience, then a formal request would be made to the copyright holders for this use. Where samples were taken either by typing or by OCR from published works, formal permission for their use has not yet been requested, as it was regarded that the samples of 2000 words in most cases could be regarded as "fair-dealing" for academic research purposes under the Copyright Acts. Any future public use of these materials will require the formal permission of their copyright holders.

It was decided to use the analytical software for Welsh which had been developed for a Welsh language spelling checker, then under way in the School of Psychology for Bwrdd yr Iaith Gymraeg / The Welsh Language Board. This spelling checker in its improved form involved a set of lemmatization algorithms for handling the language in a computer environment and it was felt that these programs could be adaptable for lemming the CEG text samples. The basic program for the spelling checker was modified to allow it to process and analyze the texts in an interactive way. This required the ability to present the original text on screen for inspection by the researcher, and to offer interactive dialogue boxes to solve two fundamental problems with the software. These were, the appearance of words or word forms which did not appear in the spelling checker铆s own dictionary, and the possibility of homographs. The latter difficulty was solved by arranging for the software to identify a lemma by stripping off a particular ending and/or by demutating a word, then continuing to try possible endings and initial mutations in combinations with other lemmas to check for possible homographs, effectively on the fly. Any such forms identified were presented on-screen to the researcher, with the original text still visible, to allow an informed choice to be made between the possibilities. In a similar way, the appearance of an unrecognized word or word form generated a dialogue box to allow the researcher to enter such words into a user dictionary, as well as allowing the forms to be incorporated into the tagged files which were produced from each separate text sample.

The main researcher worked on 350 out of the 500 samples, and a part-time researcher was employed through the Welsh IT Unit to analyze 150 of the samples. The average time for the analysis of each was around 1 hour, though the need to read over and correct typed or OCR scanned text, raised this to a figure of around 2 hours per sample.

(Top of Page)

File formats and Character coding conventions

All files are Windows files with<CR><LF> used as line separator.

Accents are place after the vowel ( + = circumflex, % = dieresis, / = acute accent, \ = grave accent)

(Top of Page)

Description of the text files

Details of the 500 text samples are provided in the files below which list file number, text category, title, author and date.

The description data can be downloaded in the following formats:

The text category codes are as follows:

	Rh Ff
Gwasg - Gwyddonol	G Gw	Press - Scientific
Gwasg - Adroddiad	G A	Press - Report
Gwasg - Golygyddol	G G	Press - Editorial
Gwasg - Adolygiad	G Ad	Press - Review
Gwasg - Llythyrau	G Ll	Press - Letters
Plant - Ffeithiol	P Ff	Factual - Children
Ysgrythurol	Y	Scriptural
Bro a Bywyd Gwerin	B	Community Life
Gweinyddol - Adroddiad	Gw Ad	Administrative - Report
Gweinyddol - Llythyrau	Gw Ll	Administrative - Letters
Gweinyddol - Cofnodion/cytundebau	Gw C	Administrative - Minutes/contracts
Academaidd	A	Academic
Hunangofiant / Cofiant/ Dyddiaduron / Atgofion	H	Biography/ Diaries/Memories
Sgyrsiau/pigion	S	Discussions/ Highlights
Medrau a Diddordebau	M	Skills and Interests
Rhyddiaith Ddychmygol	Rh Dd	Fiction
Nofelau	N	Novels
Straeon Byrion	SB	Short Stories
Plant - Nofel	PN	Children's Novel
Plant - Straeon	PS	Children's Stories
Dyddiadur Dychmygol	D	Fictitious Diaries
Ysgrifau	YS	Articles/ Essays

(Top of Page)

The Raw and Tagged Datafiles

Most users will probably only want to access the processed results - the frequency counts of word forms or lemmas presented below. However, we also provide the original text samples as ASCII files along with the 500 tagged files for those who need to find words or constructions in their original context or for scholars who wish to correct or take forward the analyses presented here.

The 500 original text samples, each of approximately 2000 words:

Original ASCII files (zipped) (2.1Mb)

The 500 tagged files have the following format :

Lemma [tab] Raw word [tab]Part Of Speech [ [tab] Mutation - if present ] [tab] Line Number

Each line shows the lemmatized form, the original word, the part of speech, type of mutation if present, and the location of the word (sample number, sentence number within sample, word number within sentence). For verbal forms, a number is used with the lemma to show the particular morphographemic form appearing.

Illustration of a sample sentence from a text follows:

a		part
bod:3		vbf
hynny		DemPron
'n		vbadj
golygu		vb
bod		vb	[74.2.7]
	y		[74.2.8]
	rhai		[74.2.9]
	dagreuol		[74.2.10]
	yn		[74.2.11]
	ein		[74.2.12]
	plith		[74.2.13]
	yn		[74.2.14]
	iachach		[74.2.15]
	na		[74.2.16]
	'r		[74.2.17]
	rhai		[74.2.18]
	sych		[74.2.19]
	?

We believe this text corpus is of value for an analysis of Welsh prose sentence patterns, for co-occurrence analyses of both individual lemmas and grammatical parts of speech in running texts, and for further linguistic analysis by specialist researchers in the field of Welsh syntax and child language acquisition. However, researchers must take note of some limitations in data quality, particularly regarding the accuracy of some of the lemma tags which were prejudiced by word form homography - these limitations are described below.

All Tagged Files (zipped) (All fields are tab delimited) - 8 Mb
- 1*.tag (approx 2Mb each )
- 2*.tag
- 3*.tag
- 4*.tag
  
  (Top of Page)

Data quality

We believe that the accuracy of the raw word forms in the database and their counts is quite high. Whatever errors (spelling or typographical) there were in the original samples will be carried over to the corpus. We must surely have introduced and failed to detect some additional errors in input, but we have tried hard to keep this number very low.

Tag quality is something of a different matter. The problems of high homography rates, a limited window template-matching lemmatiser with few rules, and the need for skilled linguistic analysis, compounded into a non-trivial number of tagging errors. A preliminary analysis of 5% of the corpus indicates that there is an error rate of 4% +/- 3%. These tagging errors are by no means distributed equally about the database. Thus, for example, inaccuracies in the tagging of yn, bod/fod, and a, that is more generally the high frequency closed class words, are much more common than inaccuracies with the open class words. Thus while the token error rate is perhaps 4%, the type error rate is much less than that.
We do not have the resources to correct these miscodings. As well as noting the errors on a print-out of the output files, it would be necessary for any corrections to be written back to the files, and we estimate that a detailed correction of the full set would require two years work.. Having tried to raise these resources, and waited too long, we have decided to release the database as it now stands - it is certainly better than nothing.

Nonetheless, researchers must take note of these limitations in data quality, particularly regarding the accuracy of some of the lemma tags.

We believe the Counts of raw word forms to be highly accurate.

The Lemma Counts with analysis of inflections and mutations runs at about 96% accuracy with most problems on the high frequency closed class words.

(Top of Page)

Processed Results:

Counts of Raw Word Forms

The word counts are based on the actual word forms occurring. These words include spellings which represent dialectal forms, informal spellings of Welsh forms (generally following the suggestions of Cymraeg Byw, though this is by no means a universally applied standard for informal writing), foreign words (particularly from English), as well as wrongly spelled Welsh words (that is, misprints in the original texts).

Total number of word form tokens in the corpus is 1,079,032.

The total number of separate word form types is 37,195.

The 50 most frequent raw word forms are:

55588	yn	.	3821	cael
45945	y	.	3754	yw
33327	i	.	3546	wrth
33231	a	.	3545	ni
32573	'r	.	3463	hyn
26927	o	.	3023	na
15888	ar	.	2870	o+l
14990	ei	.	2721	hynny
14845	'n	.	2646	fe
14523	yr	.	2613	er
11785	ac	.	2594	neu
9922	oedd	.	2585	nid
9338	bod	.	2542	at
9056	mae	.	2511	sy
7751	am	.	2417	'w
7093	wedi	.	2401	hi
6118	ond	.	2360	dim
5568	un	.	2278	mynd
5415	'i	.	2240	byddai
5294	eu	.	2160	gyda
4991	gan	.	2137	yng
4988	fel	.	2110	iawn
4578	mewn	.	2066	pob
4149	a+	.	2065	lle
4142	roedd	.	2027	pan

At the other end of the frequency range, there is a very long tail of single occurrence forms, with 44% of the total entries falling in to this group, and between them, the numbers of single, double and triple occurrence words make up 64% of the total number of separate words (37,195). As might be expected, a large number of these very low frequency words consist of foreign borrowings, mis-spellings, dialectal forms and other types of variant spellings, and numbers. In most cases, the analysis program does distinguish between several of these categories (mis-spellings, foreign words, informal spellings), but such entries would require further checking if 100% accuracy was essential.

16,316 words with a single occurrence :	44% of separate words
5,013 words with two occurrences :	13% of separate words
2,644 words showing three occurrences:	7% of separate words

(Top of Page)

Lemma Counts with analyses of inflections and mutations

The lemming software was used to demutate and uninflect word forms in order to track them back to their lemma. Examples of the resulting lemma analysis are shown for illustration in the table below:

ceg	118	ceg	n	118	ceg	109	nf	ceg	22	nf
								cheg	21	nf	llaes
								geg	56	nf	meddal
								ngheg	10	nf	trwynol
					cegau	9	npl	cegau	9	npl
rhodio	16	rhodio	vb	16	rhodia	2	vbf	rhodia	1	vbf :3
								rodia	1	vbf :3	meddal
					rhodiai	1	vbf	rodiai	1	vbf :10	meddal
					rhodio	12	vb	rhodio	7	vb
								rodio	5	vb	meddal
					rhodiwn	1	vbf	rhodiwn	1	vbf :4.1

The lemma ceg appears 118 times. It appears exclusively as a noun. 109 of these occurrences are as the noun singular feminine (ceg) and 9 as the noun plural (cegau). As the singular noun it appeared 22 in unmutated form, 21 times with aspirate mutation, 56 with soft mutation, and 10 times as a nasal mutation.

The lemma rhodio appeared 16 times, always as a verb. Two of these occurrences were as the third person singular present (rhodia) (once in unmutated form and once with soft mutation), 1 occurrence was as the third person singular imperfect in soft mutated form (rodia), 12 occurrences as the verb noun rhodio (7 times unmutated and 5 times with soft mutation), and once as the third person plural present tense (rhodiwn). There are many verb forms for Welsh - the full list of verb form codes is shown below.

Verb-form Codes

The table of verb form codes is shown below:

1	af	present tense first person singular
2	i	present tense second person singular
3	a	present tense third person singular
4	wn	present tense first person plural
5	wch	present tense second person plural
6	ant	present tense third person plural
7	ir	present tense impersonal
8	it	imperfect tense first person singular
9	et	imperfect tense second person singular
10	ai	imperfect tense third person singular
11	em	imperfect tense first person plural
12	ech	imperfect tense second person plural
13	ent	imperfect tense third person plural
14	id	imperfect tense impersonal
15	ais	past tense first person singular
16	aist	past tense second person singular
17	odd	past tense third person singular
18	asom	past tense first person plural
19	asoch	past tense second person plural
20	asant	past tense third person plural
21	wyd	past tense impersonal
22	aswn	pluperfect first person singular
23	asit	pluperfect second person singular
24	aset	pluperfect second person singular
25	asai	pluperfect third person singular
26	asem	pluperfect first person plural
27	asech	pluperfect second person plural
28	asent	pluperfect third person plural
29	asid	pluperfect impersonal
30	ed	impersonal imperative
31	wyf	subjunctive first person singular
32	ych	subjunctive second person singular
33	o	subjunctive third person singular
34	om	subjunctive first person plural
35	och	subjunctive second person plural
36	ont	subjunctive third person plural
37	er	subjunctive second person singular
38	es	past tense first person singular
39	est	past tense first person singular
40	ith	Informal third person singular
41	iff	Informal Future third person singular
42	on	Informal Past third person plural
43	an	Informal Future third person plural

The file, Lemma Counts with Analysis, downloadable below, is tab-separated and can be imported into Excel where it can be readily manipulated to provide a wide range of analyses. One example, based on a sort of the final field (mutation), generates the following results for initial mutations.

Initial mutations

Welsh words can exhibit one of four types of morphophonemic initial mutation, and the occurrences and relative frequencies of such forms in the sample are:

Soft mutation (Treiglad Meddal)	134,349	12.45%
Spirant mutation (Treiglad Llaes)	9,123	0.85%
Nasal mutation (Treiglad Trwynol)	5,667	0.53%
h-provection	1,990	0.19%

Download Word Form files

Zip file containing: (890Kb)

Word Counts (freq) - Counts of raw word forms sorted in decreasing frequency
Word Counts (alpha) - Counts of raw word forms sorted in alphabetic order
Lemma Counts with Analysis - Counts of lemmas, plus inflected forms, parts of speech and mutations

(Top of Page)

Use of these Materials

These materials have been produced on a small budget for academic research. You are welcome to use the materials for any non-commercial purpose. We have produced these analyses in good faith to the best of our abilities given the limited resources. As we have described above, you should be aware that there are some inaccuracies in the taggings. We bear no responsibility for any damaging consequences that may result from these.

We welcome further research to extend or correct these linguistic descriptions.

Articles based on the use of the database should cite:

Ellis, N. C., O'Dochartaigh, C., Hicks, W., Morgan, M., & Laporte, N. (2001). Cronfa Electroneg o Gymraeg (CEG): A 1 million word lexical database and frequency count for Welsh. [On-line]

Available: www.bangor.ac.uk/canolfanbedwyr/ceg.php.en