Skip to main navigation menu Skip to main content Skip to site footer

Profiling specialized web corpus qualities: A progress report on "Domainhood"

Abstract

In this article we describe ways to profile the domain specificity, a.k.a. domainhood, of specialized web corpora in English and in Swedish. Several studies have been carried out to measure the "qualities" of general-purpose web corpora. On the contrary, less attention has been paid to the evaluation of specialized or domain-specific web corpora. To fill this gap, in this article we present
case studies where we explore the effectiveness of several statistical measures – i.e. rank correlation coefficients (Kendall and Spearman), Kullback–Leibler divergence, log-likelihood and burstiness -to assess domainhood. Our findings indicate that it is possible to profile the domainhood quality of a corpus. However, further research is needed to generalize on the results

Keywords

corpus evaluation, term extraction, log-likelihood, rank correlation, Kullback-Leibler divergence

PDF