Treffer: Using Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Finnish Historical Journalistic Collection

Title:
Using Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Finnish Historical Journalistic Collection
Contributors:
National library of Finland, Equipe Apprentissage (LITIS - DocApp), Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes (LITIS), Université Le Havre Normandie (ULH), Normandie Université (NU)-Normandie Université (NU)-Université de Rouen Normandie (UNIROUEN), Normandie Université (NU)-Institut national des sciences appliquées Rouen Normandie (INSA Rouen Normandie), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA)-Université Le Havre Normandie (ULH), Institut National des Sciences Appliquées (INSA)-Normandie Université (NU)-Institut National des Sciences Appliquées (INSA), Equipe Apprentissage (LITIS - App), Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes (LITIS), FEDER PlaIR, PIVAJ
Publisher Information:
CCSD, 2019.
Publication Year:
2019
Collection:
collection:INSA-ROUEN
collection:LITIS
collection:COMUE-NORMANDIE
collection:UNIROUEN
collection:UNILEHAVRE
collection:INSA-GROUPE
Original Identifier:
HAL: hal-04485545
Document Type:
Zeitschrift article<br />Journal articles
Language:
English
Rights:
info:eu-repo/semantics/OpenAccess
Accession Number:
edshal.hal.04485545v1
Database:
HAL

Weitere Informationen

It is a common practice that historical newspaper collections are digitized on page level: pagesof the physical newspapers are scanned and OCRed and the page images serve as the basic brows-ing and searching unit of the collection. Searches to the collection are made on page level andresults are shown on page level to the user. Page, however, is not any kind of basic informationalunit of a newspaper, only a typographical or printing unit. Pages consist of articles or news items(and advertisements or notices of different kind, too), although length and form of them can bequite variable. Thus, separation of the article structure of digitized newspaper pages is an im-portant step to improve usability of digital newspaper collections. As the amount of digitizedhistorical journalistic information grows, also good search, browsing and exploration tools forharvesting the information are needed, as these affect usability of the collection. Contents of thecollections are one of the key elements of usefulness of the collections, but also presentation ofthe contents for the user is important. Possibility to use article structure will also improve furtheranalysis stages of the content, such as topic modeling or any other kind of content analysis. Sev-eral digitized historical newspaper collections have implemented article extraction on their pages.Good examples are for example Italian La Stampa, British Newspaper Archive, and AustralianTrove.The historical digital newspaper archive environment of the National Library of Finland isbased on commercial docWorks software. The software is capable of article detection and ex-traction, but our material does not seem to behave well in the system in this respect. We have notbeen able to produce good article segmentation with docWorks, although such work has beenaccomplished e.g. in the Europeana Newspaper framework. However, we have recently producedarticle separation and marking on pages of one newspaper, Uusi Suometar, by using article ex-traction software named PIVAJ developed in the LITIS laboratory of University of Rouen Nor-mandy [1]. In this article we describe intended use of the extracted articles in our digital librarypresentation system, digi.kansalliskirjasto.fi (Digi), as newspaper clippings which can be col-lected by the user out of the markings of the article extraction software.