Treffer: Using an N-Gram-based document representation with a vector processing retrieval model

Title:
Using an N-Gram-based document representation with a vector processing retrieval model
Authors:
Source:
TREC-3: text retrieval conferenceNIST special publication. (500225):269-277
Publisher Information:
Gaithersburg, MD: National Institute of Standards and Technology, 1995.
Publication Year:
1995
Physical Description:
print, 8 ref
Original Material:
INIST-CNRS
Document Type:
Konferenz Conference Paper
File Description:
text
Language:
English
ISSN:
1048-776X
Rights:
Copyright 1997 INIST-CNRS
CC BY 4.0
Sauf mention contraire ci-dessus, le contenu de cette notice bibliographique peut être utilisé dans le cadre d’une licence CC BY 4.0 Inist-CNRS / Unless otherwise stated above, the content of this bibliographic record may be used under a CC BY 4.0 licence by Inist-CNRS / A menos que se haya señalado antes, el contenido de este registro bibliográfico puede ser utilizado al amparo de una licencia CC BY 4.0 Inist-CNRS
Notes:
Sciences of information and communication. Documentation

FRANCIS
Accession Number:
edscal.2484568
Database:
PASCAL Archive

Weitere Informationen

N-gram based representations for documents have several distinct advantages for various document processing tasks. First, they provide a more robust representation in the face of grammatical and typographical errors in the documents. Secondly, N-gram representations require no linguistic preparations such as word-stemming or stopword removal. Thus they are ideal in situations requiring multi-language operations. Vector processing retrieval models also have some unique advantages for information retrieval tasks. In particular, they provide a simple, uniform representation for documents and queries, and an intuitively appealing document similarity measure. Also, modern vector space models have good retrieval performance characteristics. In this work, we combine these two ideas by using a vector processing model for documents and queries, but using N-gram frequencies as the basis for the vector element values instead of more traditional term frequencies. The resulting system provides good retrieval performance on the TREC-1 and TREC-2 tests without the need for any kind of word stemming or stopword removal. We also have begun testing the system on Spanish language documents.