Ir al menú de navegación principal Ir al contenido principal Ir al pie de página del sitio

Profiling specialized web corpus qualities: A progress report on "Domainhood"

Resumen

In this article we describe ways to profile the domain specificity, a.k.a. domainhood, of specialized web corpora in English and in Swedish. Several studies have been carried out to measure the "qualities" of general-purpose web corpora. On the contrary, less attention has been paid to the evaluation of specialized or domain-specific web corpora. To fill this gap, in this article we present
case studies where we explore the effectiveness of several statistical measures – i.e. rank correlation coefficients (Kendall and Spearman), Kullback–Leibler divergence, log-likelihood and burstiness -to assess domainhood. Our findings indicate that it is possible to profile the domainhood quality of a corpus. However, further research is needed to generalize on the results

Palabras clave

corpus evaluation, term extraction, log-likelihood, rank correlation, Kullback-Leibler divergence

PDF (English)