Date: Sept 6th, 2011
Room: HSEB 3515B
Dissertation Title: TASK-DRIVEN DYNAMIC TEXT SUMMARIZATION
Supervisory Committee: Alun Thomas, chair, Nicola Camp, Hilary Coon, Lewis Frey, Scott Narus
Prior to her work in biomedical informatics, Liz worked for several years in public, corporate, and academic libraries. She received a Masters degree in Library and Information Studies from the University of Hawaii, and a B.A. in English (with a minor in Italian) from the University of Utah. Her primary research interests involve NLP applications utilizing multiple data types, for several purposes, including clinical decision support and knowledge discovery. When she is not working on BMI projects, she likes to spend her time reading, hiking, and being with her daughter.
The objective of this work is to examine the efficacy of natural language processing (NLP) in summarizing bibliographic text for multiple purposes. Researchers have noted the accelerating growth of bibliographic databases. Information seekers using traditional information retrieval techniques when searching large bibliographic databases are often overwhelmed by excessive, irrelevant data.
Scientists have applied natural language processing technologies to improve retrieval. Text summarization, a natural language processing approach, simplifies bibliographic data while filtering it to address a user’s need. Traditional text summarization can necessitate the use of multiple software applications known as schemas to accommodate diverse processing refinements known as “points-of-view”.
A new, statistical approach to text summarization can transform this process. Combo, a statistical algorithm comprised of three individual metrics, determines which elements within input data are relevant to a user’s specified information need, thus enabling a single software application to summarize text for many points-of-view. In this dissertation, I describe this algorithm, and the research process used in developing and testing it. Three separate studies comprised the research process. The goal of the first study was to create a conventional schema accommodating a genetic disease etiology point-of-view, and an evaluative reference standard. This was accomplished through simulating the task of secondary genetic database curation. The second study addressed the development and initial evaluation of the algorithm, comparing its performance to the conventional schema using the pre-established reference standard, again within the task of secondary genetic database curation. The third study evaluated the algorithm’s performance in accommodating two additional points-of-view, namely prevention and drug treatment, in a simulated clinical decision support task, which also employed a conventional schema that accommodated a treatment point-of-view.
Both summarization methods successfully identified data that was salient to the tasks they addressed. The conventional genetic etiology of disease schema located salient information for database curation. The conventional treatment schema located treatment data relevant to clinical decision support. The Combo algorithm located salient genetic disease etiology, treatment, and prevention data, for their respective tasks.
Successful dynamic text summarization could potentially serve many purposes. Future users from many groups may benefit from this technology.