Newspaper Digitization System

Client
Confidential
Platform
Web
Duration
5 months
Industry
Document digitization
Country
European Union
Newspaper Digitization System
98%
Recognition accuracy
10x
Faster than manual archival processing

A newspaper digitization app for a European document scanning agency. Detection of articles that span across multiple columns, text extraction, and article type detection

Need a digital archive?

We develop custom solutions for document digitization

Get A Quote

Services
Dataset analysis
OCR system development
Team
1 Project manager
1 Machine learning developer
1 Full-stack engineer
Target Audience
National archives
Museums
Historical societies

Challenge

Our client is a large document scanning and digitization company based in Europe that was looking for a way to improve and scale their newspaper scanning efforts.

Our client has approached us with the task of creating an application for intelligent newspaper processing and digitization. The task of digitizing a newspaper is the most complicated OCR-based one as newspapers have complex layouts with articles spanning across multiple columns, starting and stopping in arbitrary places, often broken up by images and advertisements.

Solution

We started with the analysis of newspaper pages to get a clear understanding of the page and article structure. During this analysis we have discovered that many historical newspapers have suffered some damage, like paper warping, scratches, or fading ink. This forced us to develop a preprocessing module for increasing the quality of scanned newspaper pages.

Newspaper Preprocessing

This module removes any paper warp caused by old age or moisture damage, removes dust and scratches, and fills in letters and symbols that have faded with time.

Certain small symbols, like punctuation and umlauts, tend to get lost during OCR especially if the source material is not in the best condition, therefore we have put additional effort into preservation of these symbols during preprocessing.

Newspaper Article Extraction

The next challenge was the detection of various articles and their types. Newspaper articles are difficult if not impossible to accurately detect for standart OCR solutions due to their complex structure and huge formatting variability.

Our application detects text blocks that belong to the same article across the entire page and compiles them together in a correct order, including headers, subheaders, illustrations, authorship attribution, and any other elements that are part of the same article. For each detected article our system determines its type, like an editorial article, an advertisement, an obituary, etc.

The text and images from each article are extracted in a form of editable and searchable text document.

Visual Editor

All relations between article blocks are presented visually and can be manually edited using a visual editor. The user can click on any item within a newspaper page and reassign its article affiliation, change the blocks order, and change article types.

Recognition Quality

We have managed to achieve 98-99% recognition quality, making our system a reliable solution for digitizing newspapers of any time period and publisher.

Results

Our newspaper digitization system is successfully used to create high-quality digital copies of historical and modern newspapers and has helped our client to grow their business and become one of the top document digitization companies in Europe.

Success Stories

How Much Does It Cost To Develop AI Healthcare Software

How Much Does It Cost To Develop AI Healthcare Software

December 2022
Bird Recognition App

Bird Recognition App

December 2022
How to Create Artificial Intelligence Software

How to Create Artificial Intelligence Software

June 2022
AI Document Automation Accuracy Improvement Techniques

AI Document Automation Accuracy Improvement Techniques

January 2024
AI-powered Digital Media Recommendation System

AI-powered Digital Media Recommendation System

April 2021
How To Choose The Right Software Development Team

How To Choose The Right Software Development Team

January 2023
AI-powered Aspect Extraction System For Amazon Reviews

AI-powered Aspect Extraction System For Amazon Reviews

February 2022

Contact Us

Let's Work Together!

Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.

Please fill in the 'Name'
Please fill in the 'Phone'
Please fill in the 'Email'
Please fill in the 'Message'
BWT Chatbot