Treffer: Exploiting Near-Duplicate Relations in Organizing News Archives

Title:

Exploiting Near-Duplicate Relations in Organizing News Archives

Authors:

WANG, Jenq-Haur, CHANG, Hung-Chi

Source:

International journal of intelligent systems. 29(7):597-614

Publisher Information:

Hoboken, NJ: Wiley, 2014.

Publication Year:

2014

Physical Description:

print, 29 ref

Original Material:

INIST-CNRS

Subject Terms:

Computer science, Informatique, Sciences exactes et technologie, Exact sciences and technology, Sciences appliquees, Applied sciences, Informatique; automatique theorique; systemes, Computer science; control theory; systems, Logiciel, Software, Systèmes informatiques et systèmes répartis. Interface utilisateur, Computer systems and distributed systems. User interface, Organisation des mémoires. Traitement des données, Memory organisation. Data processing, Systèmes d'information. Bases de données, Information systems. Data bases, Intelligence artificielle, Artificial intelligence, Reconnaissance et synthèse de la parole et du son. Linguistique, Speech and sound recognition and synthesis. Linguistics, Actualités, News, Noticias, Amas, Cluster, Montón, Analyse amas, Cluster analysis, Analisis cluster, Analyse contenu, Content analysis, Análisis contenido, Analyse documentaire, Document analysis, Análisis documental, Analyse statistique, Statistical analysis, Análisis estadístico, Archive, Archivo, Classification, Clasificación, Efficacité, Efficiency, Eficacia, Informatique documentaire, Documentation data processing, Informática documental, Internet, Navigation information, Information browsing, Navegacíon informacíon, Phrase, Sentence, Frase, Précision élevée, High precision, Precisión elevada, Relation ordre, Ordering, Relación orden, Répétition, Repetition, Repetición, Réseau social, Social network, Red social, Réseau web, World wide web, Red WWW, Résultat expérimental, Experimental result, Resultado experimental, Similitude, Similarity, Similitud

Document Type:

Fachzeitschrift Article

File Description:

text

Language:

English

Author Affiliations:

Department of Computer Science and Information Engineering, National Taipei University of Technology, Tawain, Province of China
Institute of Information Science, Academia Sinica, Tawain, Province of China

ISSN:

0884-8173

Access URL:

http://pascal-francis.inist.fr/vibad/index.php?action=search&terms=28594282

Rights:

Copyright 2015 INIST-CNRS
CC BY 4.0
Sauf mention contraire ci-dessus, le contenu de cette notice bibliographique peut être utilisé dans le cadre d’une licence CC BY 4.0 Inist-CNRS / Unless otherwise stated above, the content of this bibliographic record may be used under a CC BY 4.0 licence by Inist-CNRS / A menos que se haya señalado antes, el contenido de este registro bibliográfico puede ser utilizado al amparo de una licencia CC BY 4.0 Inist-CNRS

Notes:

Computer science; theoretical automation; systems

Accession Number:

edscal.28594282

Database:

PASCAL Archive

Weitere Informationen

Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross-document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near-duplicate relations. First, we use a sentence-level statistics-based approach to near-duplicate copy detection, which is language independent, simple but effective. Since content-based approaches are usually time consuming and not robust to term substitutions, near-duplicate detection approach can be used. Second, by extracting the cross-document relations in a block-sharing graph, we can derive a near-duplicate clustering by cross-document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near-duplicate documents in news archives.

Treffer: Exploiting Near-Duplicate Relations in Organizing News Archives

Weitere Informationen

Links

Zusatz-Funktionen