Séminaire de Albert GATT, maître de conférence à l'Université de Malte organisé par le laboratoire LIG

Jeudi 28 janvier à 15h, UFR IMAG bat. F



Natural Language Generation (NLG) is concerned with the automatic generation of text or speech from non-linguistic input data. Over the past two decades, NLG has been shown to be effective in a wide variety of settings, especially in domains where complex data or knowledge structures need to be communicated to humans in an easily understandable format. Examples of applications include weather report generation and summarisation of medical data. The development of these applications have also led to the widespread use of evaluation methods which focus on task effectiveness, though these are not the only methods used in the field.

 

Although there is a fairly well-established evaluation tradition in NLG, relatively little work on comparing different types of evaluation methods has been undertaken. Thus, several open questions remain, which echo related concerns in fields such as Machine Translation and Summarisation. Some of the relevant questions are :
- Do intrinsic and extrinsic evaluation methods in NLG give compatible results ?
- Can corpora be treated as reliable ‘gold standards’ against which to compare NLG outputs using automatic metrics ?
In this talk, I will first introduce the field of NLG and its sub-tasks. One of these sub-tasks is the Generation of Referring Expressions (GRE), which is concerned with the construction of suitable noun phrases for identifying entities in a relevant domain of discourse (e.g. the tall man with glasses, the red chair in the top left). GRE has also been the focus of a series of Shared Task Evaluation Challenges, organised over a period of 3 years. As a result, there is a substantial body of evaluation data for different systems, including automatically computed metrics against corpora, human judgements of quality, and task-based measures collected through controlled experiments. I will discuss some results from follow-up studies on this data, which have revealed significant divergences among these different methods, suggesting that the relationship between corpus-based measures and task-based measures is problematic. The talk will also attempt to link these results to related findings in MT and Summarisation.