Share

The Bolshoi Theatre Museum has completed a major project to digitise a range of historical documents, with the aim of making the information publicly accessible and searchable via its website.

Four thousand volunteers helped scan 8,000 historic posters, 120,000 programmes and 100,000 rare photographs from the 192-year-old Russian theatre's museum archives, in order to convert them into digital formats. See also: New York Philharmonic gives historic archives new life by digitising them

“The three key objectives of the project are to rediscover, preserve and share the Bolshoi Theatre’s history, as part of the world’s cultural heritage,” Bolshoi Theatre director, Lidia Kharina, told Computerworld UK.

“The project helps to uncover previously overlooked facts, patterns and insight. It also helps protect and preserve the artefacts by creating digital copies. And lastly, it is important to make the archive easily searchable and accessible to the public all over the world.”

Kharina said that it was previously impossible to find a full and consistent record of a specific performer’s roles at the Bolshoi, for example, or when they had their debut on stage. “Questions such as these will be easy to answer with the new digital archive,” she explained.

As with all large scale digitisation projects, the Bolshoi Theatre Museum faced a number of technical challenges in converting documents to machine readable formats. 

Kharina said that stage one of the project involved scanning posters and programmes held in the Bolshoi Museum using ABBYY FineReader. FineReader is a PDF tool with optical character recognition (OCR) capabilities which is able to recognise and digitise scanned information.

The software scans an image, transforming it into a digital format that is turned automatically into machine readable text.

The first challenge was to make sense of the old spellings and heritage fonts, as well as damaged or dark original documentation. "This was solved using the OCR technology which recognises text, even in sub-standard conditions,” Kharina said.

The second challenge was around coordinating the many volunteers who were dispersed across multiple locations. "This was overcome by making the technology accessible remotely, by multiple users at the same time," she said.

Finally, unstructured data such as scanned images often caused problems when it comes to classifying and analysing information. “This challenge was overcome using a proprietary combination of full-text semantic analysis and machine learning.”

The next stage of the project will be to use text analytics to categorise the unstructured data and put the information into the correct database fields of the digital archive. 

“ABBYY AI experts have written complex and comprehensive rules to teach the data extraction algorithm to take into account the varied structures of heritage documents and the similarity of certain data entities such as ‘role’ and ‘last name’, as well as the order and so on,” Kharina said.

She added that the machine learning algorithms have then learned from this linguistic and structural information and put the information into the correct database fields. “Now the volunteers are starting the verification process to find and fix any mistakes that could have occurred, making sure that the names, titles and musical instruments used, are all found correctly and put in the right field.”

Read next: London Theatre Direct's API strategy sees 6000 percent rise in sales.

Find your next job with computerworld UK jobs