Treffer: Generating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of Webpages ; Generating a Summary for a Shooting Event

Title:
Generating an Intelligent Human-Readable Summary of a Shooting Event from a Large Collection of Webpages ; Generating a Summary for a Shooting Event
Publication Year:
2014
Collection:
VTechWorks (VirginiaTech)
Document Type:
Konferenz conference object<br />report<br />software
File Description:
application/pdf; application/vnd.openxmlformats-officedocument.wordprocessingml.document; application/vnd.openxmlformats-officedocument.presentationml.presentation; application/octet-stream
Language:
English
Rights:
Creative Commons CC0 1.0 Universal Public Domain Dedication ; http://creativecommons.org/publicdomain/zero/1.0/
Accession Number:
edsbas.5555BD36
Database:
BASE

Weitere Informationen

Filename and description of all included files: ShootingsReportPdf.pdf - PDF version of our project report. ShootingsReportDoc.docx - MS Word version of our project report. ShootingsPresentationPpt.pptx - MS PowerPoint version of our project presentation. ShootingsPresentationPdf.pdf - PDF version of our project presentation. Source code is included in folder "shooting_summary_code": shooting_summary_code/mapper.py - The map script used on the Hadoop cluster. This file contains the regular expressions which are used to extract the data from the collections. shooting_summary_code/parse.py - Standard frequency based reducer on the cluster. Takes sorted input from the mapper, and reduces based on frequency. shooting_summary_code/reducer.py - Python which parses the output from the mapper and reducer. Does not run on Hadoop cluster. Relies on data being sorted by frequency (most frequent at top of file). Contains the implementation of the regular grammar, along with filtering techniques. shooting_summary_code/TrigramTagger.pkl - A Python pickled version of our trigram tagger, which is used in parse.py. ; We describe our approach to generating summaries of a shooting event from a large collection of webpages. We work with two separate events - a shooting at a school in Newtown, Connecticut and another at a mall in Tucson, Arizona. Our corpora of webpages are inherently noisy and contain a large amount of irrelevant information. In our approach, we attempt to clean up our webpage collection by removing all irrelevant content. For this, we utilize natural language processing techniques such as word frequency analysis, part of speech tagging and named entity recognition to identify key words about our news events. Using these key words as features, we employ classification techniques to categorize each document as relevant or irrelevant. We discard the documents classified as irrelevant. We observe that to generate a summary, we require some specific information that enables us to answer important questions such as "Who ...