Case Studies

Newspaper Digitization System

Industry:
Document digitization
Client:
Confidential
Platform:
Web
Country:
European Union
Duration:
5 months
98%
Recognition accuracy
Searchable text
Digital archive solution
Newspaper Digitization System

Project Summary

A newspaper digitization app for a European document scanning agency. Detection of articles that span across multiple columns, text extraction, and article type detection

Need a digital archive?

We develop custom solutions for document digitization

Get A Quote

Services

Dataset analysis
OCR system development

Team

1 Project manager
1 Machine learning developer
1 Full-stack engineer

Target Audience

National archives
Museums
Historical societies

Challenge

Our client is a large document scanning and digitization company based in Europe that was looking for a way to improve and scale their newspaper scanning efforts.

Our client has approached us with the task of creating an application for intelligent newspaper processing and digitization. The task of digitizing a newspaper is the most complicated OCR-based one as newspapers have complex layouts with articles spanning across multiple columns, starting and stopping in arbitrary places, often broken up by images and advertisements.

Solution

We started with the analysis of newspaper pages to get a clear understanding of the page and article structure. During this analysis we have discovered that many historical newspapers have suffered some damage, like paper warping, scratches, or fading ink. This forced us to develop a preprocessing module for increasing the quality of scanned newspaper pages.

Newspaper Preprocessing

This module removes any paper warp caused by old age or moisture damage, removes dust and scratches, and fills in letters and symbols that have faded with time.

Certain small symbols, like punctuation and umlauts, tend to get lost during OCR especially if the source material is not in the best condition, therefore we have put additional effort into preservation of these symbols during preprocessing.

Newspaper Article Extraction

The next challenge was the detection of various articles and their types. Newspaper articles are difficult if not impossible to accurately detect for standart OCR solutions due to their complex structure and huge formatting variability.

Our application detects text blocks that belong to the same article across the entire page and compiles them together in a correct order, including headers, subheaders, illustrations, authorship attribution, and any other elements that are part of the same article. For each detected article our system determines its type, like an editorial article, an advertisement, an obituary, etc.

The text and images from each article are extracted in a form of editable and searchable text document.

Visual Editor

All relations between article blocks are presented visually and can be manually edited using a visual editor. The user can click on any item within a newspaper page and reassign its article affiliation, change the blocks order, and change article types.

Recognition Quality

We have managed to achieve 98-99% recognition quality, making our system a reliable solution for digitizing newspapers of any time period and publisher.

Results

Our newspaper digitization system is successfully used to create high-quality digital copies of historical and modern newspapers and has helped our client to grow their business and become one of the top document digitization companies in Europe.

Let's Work Together!

Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.