Treffer: Discretizing continuous attributes in AdaBoost for text categorization

Title:
Discretizing continuous attributes in AdaBoost for text categorization
Source:
Advances in information retrieval (Pisa, 14-16 April 2003)Lecture notes in computer science. 2633:320-334
Publisher Information:
Berlin: Springer, 2003.
Publication Year:
2003
Physical Description:
print, 18 ref
Original Material:
INIST-CNRS
Document Type:
Konferenz Conference Paper
File Description:
text
Language:
English
Author Affiliations:
MercurioWeb SNC, Via Appia, 85054 Muro Lucano (PZ), Italy
Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, 56124 Pisa, Italy
Dipartimento di Matematica Pura ed Applicata, Università di Padova, 35131 Padova, Italy
ISSN:
0302-9743
Rights:
Copyright 2003 INIST-CNRS
CC BY 4.0
Sauf mention contraire ci-dessus, le contenu de cette notice bibliographique peut être utilisé dans le cadre d’une licence CC BY 4.0 Inist-CNRS / Unless otherwise stated above, the content of this bibliographic record may be used under a CC BY 4.0 licence by Inist-CNRS / A menos que se haya señalado antes, el contenido de este registro bibliográfico puede ser utilizado al amparo de una licencia CC BY 4.0 Inist-CNRS
Notes:
Sciences of information and communication. Documentation

FRANCIS
Accession Number:
edscal.15264423
Database:
PASCAL Archive

Weitere Informationen

We focus on two recently proposed algorithms in the family of boosting-based learners for automated text classification, AD-ABoosT.MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the weighted representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.