Tech | Digital Archive Systems Tech Blog

Complete Guide to Annotation Coordinate Conversion in Leaflet-IIIF

Overview This article explains how to accurately display annotation coordinates (in xywh format) from IIIF (International Image Interoperability Framework) Presentation API v3 manifests on a map viewer using Leaflet-IIIF. While this problem may seem simple at first glance, accurate coordinate conversion is not possible without understanding the inner workings of Leaflet-IIIF. Background Annotation Format in IIIF Manifests In IIIF Presentation API v3, the target region of an annotation is specified in xywh format as follows: ...

Complete Guide to Migrating an Omeka-S Docker Environment to Another Server

Introduction This article explains the procedure for migrating an Omeka-S environment set up with Docker Compose, including volume data, to a different server. You can proceed with the migration safely while maintaining data integrity. Environment Source server: Ubuntu 22.04 Target server: Ubuntu 22.04 (fresh setup) Stack: Omeka-S + MariaDB + phpMyAdmin + Traefik + Mailpit Migration Flow Backup on the source server Download to local machine Set up Docker environment on the target server Restore data and start up Step 1: Backup on the Source Server 1.1 Check Current Environment # d # d o o C c C c h k h k e e e e c r c r k k p r s D o u o l n c u n k m i e e n r g l s c o o l n u t m a e i s n e r s Example output: ...

Distinguishing Between RDFS and SHACL: Understanding the Relationship Between range and propertyShape

Overview When working with data in RDF (Resource Description Framework), two mechanisms come into play: “RDFS (RDF Schema)” and “SHACL (Shapes Constraint Language)”. Both can define constraints on properties and classes, but their purposes and behaviors are completely different. This article answers the following commonly confused questions: What is the difference between rdfs:domain / rdfs:range and SHACL’s sh:class / sh:datatype? Is it acceptable to set SHACL constraints that differ from the RDFS range? Is it problematic to specify a datatype (xsd:string) in SHACL when the range is a class (foaf:Person)? 1. The Fundamental Difference Between RDFS and SHACL RDFS: For Inference RDFS is a declaration that says “if this property is used, the following knowledge can be derived.” ...

Developing an RDF Metadata Management System Integrating GakuNin RDM and Dydra

Overview This article describes the development of a metadata management system for research data that integrates GakuNin RDM (Research Data Management) with the Dydra RDF database. This system can handle file management for research projects and the registration and search of Dublin Core metadata in a unified manner. System Overview Architecture ┌ │ │ └ ┌ │ │ │ └ ─ ─ ─ G ─ ─ ─ ─ a ─ ─ ─ ─ k R A ─ ─ N ( ─ ┌ │ ▼ u D P ─ ─ e A ─ ─ ─ N M I ─ ─ x p ─ ─ ─ i ─ ─ t p ─ ─ ─ n ─ ─ . ─ ─ ┐ │ │ │ ┘ ─ j R ┬ │ ┴ ─ s o ─ ─ ┌ │ │ │ └ ─ u ─ ─ ─ ─ ─ 1 t ─ ─ ─ D ─ ─ 4 e ─ ─ ▼ y R ─ ─ r ─ ┐ │ ─ d D D ─ ─ ) ─ ─ r F B ─ ─ ─ ─ a ─ ─ ─ ─ ─ ┐ │ │ ┘ ─ ─ ┐ │ │ │ ┘ Technology stack: ...

Investigating the Vocabulary Hierarchy of Odeuropa Explorer

Overview Odeuropa Explorer is a fascinating project that digitizes European olfactory heritage. Funded by the EU’s Horizon 2020 research program, it provides a platform for cross-searching and exploring historical scent experiences. The project classifies scent-related information into three main categories: Smell sources: Objects and substances that emit odors Fragrant Spaces: Places and spaces associated with scents Gestures and Allegories: Gestures and allegorical expressions related to scents This article reports the results of investigating what hierarchical structures these vocabularies have, using SKOS (Simple Knowledge Organization System) data published in the Odeuropa vocabularies repository. ...

Guide to Registering RDF Data to Dydra via API

Background Dydra is a cloud-based RDF database service that provides a SPARQL endpoint and REST API. This article explains how to programmatically register RDF data using the Dydra API. Prerequisites A Dydra account An API key A Node.js environment (v16 or later recommended, when using Node.js) Note: The code examples in this article use the following sample values: Account name: your-account Repository name: your-repository API key: your_api_key_here When using them in practice, replace these with your own Dydra account information. ...

Declarative Multi-Format Conversion with TEI Processing Model

Introduction TEI (Text Encoding Initiative) is a widely used standard for digitizing humanities texts. This article introduces a case study of using the Processing Model feature introduced in TEI P5 to achieve conversion from TEI XML to multiple formats (HTML, LaTeX/PDF, EPUB3). https://www.tei-c.org/Vault/P5/3.0.0/doc/tei-p5-doc/en/html/TD.html#TDPM The target project uses texts published in the “Koui Genji Monogatari” (Collated Tale of Genji) as an example. https://kouigenjimonogatari.github.io/ Background Previously, conversion processes were performed individually, as introduced in the following articles. ...

How to Control the Viewing Direction of Mirador from External Parameters

Overview This article explains the implementation for dynamically specifying the viewingDirection of the Mirador viewer via URL parameters. This feature allows the same manifest to be displayed left-to-right or right-to-left. Implementation 1. Retrieving URL Parameters Retrieve the viewingDirection parameter from the URL and set a default value: c c o o n n G s s e t t t u v v r i i l e e P w w a i i r n n a g g m D s i d r i = e r c e n t c e i t w o i n o U n R = L f S u r e r o a l m r P c a U h r R P a L a m r s p a . a m g r s e a ( t m w ( e i ' t n v e d i r o e s w w . i l n o g c D a i t r i e o c n t . i s o e n a ' r ) c h ; ' r i g h t - t o - l e f t ' ; In this implementation, when no parameter is specified, 'right-to-left' is used as the default. ...

Odeuropa: The World of Linked Data for Extracting Scents from Historical Documents

Overview Odeuropa is a unique project that extracts descriptions of “scents” from European historical documents and structures them as Linked Data. This article explores the actual data through the SPARQL endpoint, revealing its structure and design philosophy. What is Odeuropa? Project name: Odeuropa (Odeurs d’Europe = Scents of Europe) Database URL: https://data.odeuropa.eu/ SPARQL endpoint: https://data.odeuropa.eu/repositories/odeuropa Web interface: https://explorer.odeuropa.eu/ Data Model Overview Odeuropa uses an extended ontology specialized for scents, built on top of CIDOC-CRM (Conceptual Reference Model for Cultural Heritage). ...

Achieving Japanese Full-Text Search with the MroongaSearch Module for Omeka-S

Overview Omeka-S is a powerful digital archive system, but Japanese full-text search barely works by default. This article explains how to achieve Japanese full-text search by installing the MroongaSearch module. Background: Why the MroongaSearch Module is Needed Problems with Omeka-S Standard Search Omeka-S’s standard full-text search (FullTextSearch module) uses the InnoDB engine, which has the following critical issues: Example of Japanese word search: D S R a e e t a s a r u : c l h t " : S ( t t 東 e N u 京 r o d 大 m y 学 : h i で i n 人 " t g 工 a s 知 r a 能 t r を i t 研 f i 究 i f す c i る i c ) a i l a l i n i t n e t l e l l i l g i e g n e c n e c " e ( a 人 t 工知 t 能 h ) e U n i v e r s i t y o f T o k y o " Since InnoDB’s full-text search assumes space-delimited languages like English, the following problems occur with Japanese: ...

Azure OpenAI GPT-4 vs Document Intelligence: Comparative Evaluation of Japanese Vertical Text OCR

Overview We performed OCR processing on Japanese vertical-writing manuscript paper using two OCR services provided by Microsoft Azure (Azure OpenAI GPT-4 Vision and Azure Document Intelligence), and conducted a detailed comparative evaluation of the results. Test Image Image Source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image Characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Light grid lines (cells) Distinction between title and body sections Ground Truth 原佐原こ稿藤稿ののち用テタあ紙キイきにスト書トルくをテ使キ用スすトるが場入合りはま、す日。本作語文のや全小角論を文使をう作こっとたでりマ、ス小に説あをっ書たい文た字りをな打どつにこごと活が用でくきだまさすい。。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 1. Recognition Results by Azure OpenAI GPT-4.1 Recognized Text 原佐原こ稿藤稿のの用テタち紙キイあにストき書トルくをテ使キ用スすトるが場入合りはま、す日。本作語文のや全小角論を文使をう作こっとたでりマ、ス小に説あをっ書たい文た字りをな打どつにこごと活が用でくきだまさすい。。手書きで使用したい場合は、このテキストを削除し、印刷してご使用ください。 Evaluation GPT-4.1 demonstrated the following characteristics with vertical-writing manuscript paper: ...

LLM-Based Manuscript Paper OCR Performance Comparison: Verification of Vertical Japanese Recognition Accuracy

Introduction In this article, we compared and verified the OCR performance of major LLM models using actual manuscript paper images. While many OCR benchmarks target printed documents and horizontally written text, we evaluate recognition accuracy on the special format of Japanese vertical manuscript paper to more practically verify each model’s Japanese document understanding capabilities. Features of This Verification Using the uniquely Japanese manuscript paper format: Verification with images containing complex elements such as characters placed in grid cells, vertical writing layout, and distinctive margin composition Assuming practical use cases: Performance evaluation on manuscript paper used in actual writing scenarios such as essays, novels, and academic papers Comprehensive comparison of the latest models: Comparison of the latest models – GPT-5, GPT-4.1, Gemini 2.5 Pro, Claude Opus 4.1, and Claude Sonnet 4 – under identical conditions Verification Overview Image Used Image source: Canva template (400-character manuscript paper) URL: https://www.canva.com/ja_jp/templates/EAFbqUoH7P8/ Image characteristics: 20x20 grid, 400-character manuscript paper Vertical writing layout Faint grid lines (cells) Distinction between title area and body area ...

Challenges and Solutions for Preserving Order in PDF Transparent Text Extraction

Background When extracting the transparent text layer from PDF files, we encountered the problem of “text order differing from the original PDF.” This article explains the cause of this issue and solutions in both JavaScript and Python. There may be some inaccuracies, but we hope this serves as a useful reference. What is PDF Transparent Text? The transparent text layer of a PDF is searchable text information embedded within the PDF file. OCR-processed PDFs and digitally generated PDFs contain this transparent text layer, enabling the following features: ...

Guide to Publishing TEI/XML Files on GitHub

Introduction This article explains the procedure for uploading TEI (Text Encoding Initiative) format XML files to GitHub and creating URLs that anyone can access. TEI/XML is an international standard format for structurally describing texts such as historical documents and literary works. By using GitHub, you can share your research data with researchers around the world. What You Need A computer (Windows, Mac, or Linux) Internet connection TEI/XML files (that you already have) Email address (for creating a GitHub account) About Sample Files If you don’t have TEI/XML files, you can use the following TEI/XML file from the Koui Genji Monogatari for practice: ...

TEI ODD File Customization: A Case Study with NDL Classical Book OCR

Overview TEI (Text Encoding Initiative) is an international standard for digitizing and sharing texts in humanities research. This article introduces the process of customizing a TEI ODD file to match the output format of the NDL Classical Book OCR-Lite application. ODD (One Document Does it all) is a mechanism for customizing TEI schemas, allowing you to define your own schema containing only the elements and attributes you need. Background: Developing the NDL Classical Book OCR-Lite Application We are developing an application that outputs the results of NDL Classical Book OCR-Lite in TEI/XML format. The application is designed to perform OCR processing on Japanese classical books and output the results in standard TEI format. ...

Converting ODD to RNG/HTML Using the TEI Garage API

Introduction Generating schemas (RNG) and documentation (HTML) from TEI (Text Encoding Initiative) ODD (One Document Does it all) files is an important process in TEI projects. This article analyzes how the TEI Garage API, used internally by Roma (the TEI ODD editor), works and introduces how to call the API directly from scripts to convert ODD files. What Is TEI Garage? TEI Garage is a web service provided by the TEI community that can perform conversions between various formats. For ODD file processing in particular, it provides the following features: ...

A Scalable OCR Processing System Using NDL Classical Japanese OCR Lite on Azure Container Apps

Important Usage Notice The system described in this article may place load on external servers. Please exercise caution when using it. Server load: Parallel requests place load on target servers DoS risk: A large number of simultaneous accesses may be mistaken for a DoS attack Recommended approach: Download images locally in advance and run only the OCR processing in parallel Check terms of service: Always review the target server’s terms of service and obtain prior permission if necessary Appropriate rate limiting: In production, conservative concurrency settings (around 5-10 parallel) are strongly recommended Responsible use: Always be considerate of server administrators and other users This article is a record of a technical proof of concept. We ask readers to use the system responsibly. ...

How to Register the PROV-O Ontology in Omeka S

Introduction When building digital archives with Omeka S, using standard vocabularies for metadata description improves data interoperability. This article explains the steps to register PROV-O (PROV Ontology), developed by W3C, in Omeka S. PROV-O is an ontology for describing provenance information about data and digital objects, allowing structured recording of “who,” “when,” and “how” data was created or modified. Prerequisites Omeka S (version 3.0 or later) installed Logged in with administrator privileges Internet connection environment (required for importing from external URLs) Registration Steps 1. Accessing the Vocabulary Management Screen Log in to the Omeka S admin panel Click “Vocabularies” from the left menu Click the “Import new vocabulary” button in the upper right 2. Entering Basic Information Enter the vocabulary basic information as follows: ...

Image Collection Management Tool: Technical Architecture Explained

Overview In a previous article, we introduced an “Image Collection Management” tool designed for easily trying out IIIF features. https://zenn.dev/nakamura196/articles/7d6bb4cdc414c4 This article introduces the technologies used behind the scenes of this tool. Background The Image Collection Management Tool is a web application for managing and publishing image collections in the IIIF (International Image Interoperability Framework) format, an international standard. This article explains the technical implementation of the tool, focusing particularly on the IIIF specification implementation and the handling of geospatial information. ...

IIIF Georeference Viewer Migration to MapLibre GL and Feature Improvements

This article was created by AI with human additions. Overview We migrated the map component of the IIIF Georeference Viewer from Leaflet to MapLibre GL and implemented multiple feature improvements. This article explains the major features implemented and their technical details. https://nakamura196.github.io/iiif_geo/ Major Improvements 1. Automatic Image Rotation To display IIIF images in the correct orientation on the map, we implemented a feature that automatically calculates the rotation angle from control points (corresponding points). ...