Articles

[RDF] Configuring URI Access to Redirect to the Snorql Interface

This is a continuation of the following article. This is a memo on configuring redirects so that accessing URLs like https://xxx.abc/data/123 redirects to https://xxx.abc/?describe=https://xxx.abc/data/123, using Japan Search’s RDF store as a reference. Japan Search example: https://jpsearch.go.jp/entity/chname/葛飾北斎 -> https://jpsearch.go.jp/rdf/sparql/easy/?describe=https://jpsearch.go.jp/entity/chname/葛飾北斎 Create a conf file like the following and place it in the appropriate location (e.g., /etc/httpd/conf.d/). RewriteEngine on RewriteCond %{HTTP_ACCEPT} .*text/html RewriteRule ^/((data|entity)/.*) https://xxx.abc/?describe=https://xxx.abc/$1 [L,R=303] Then restart Apache. systemctl restart httpd This enables redirecting to the Snorql interface. ...

Returning JSON from Hugging Face Spaces

Previously, I built an inference app using Hugging Face Spaces and a YOLOv5 model (trained on the NDL-DocL dataset): This time, I modified the app above to add JSON output, as shown in the following diff: https://huggingface.co/spaces/nakamura196/yolov5-ndl-layout/commit/4d48b95ce080edd28d68fba2b5b33cc17b9b9ecb#d2h-120906 This enables processing using the returned results, as shown in the following notebook: https://github.com/nakamura196/ndl_ocr/blob/main/GradioのAPIを用いた物体検出例.ipynb There may be better approaches, but I hope this serves as a useful reference.

Building a Virtuoso RDF Store Using AWS EC2

Introduction These are notes on building a Virtuoso RDF store using AWS EC2. This covers custom domain configuration, HTTPS connection, and Snorql installation. There are many other useful articles on building Virtuoso. Please refer to them as well: https://midoriit.com/2014/04/rdfストア環境構築virtuoso編1.html https://qiita.com/mirkohm/items/30991fec120541888acd https://zenn.dev/ningensei848/articles/virtuoso_on_gcp_faster_with_cos Prerequisites An ACM Certificate should already be created. Please refer to articles such as the following: https://dev.classmethod.jp/articles/specification-elb-setting/#toc-3 EC2 First, create an EC2 instance. Select Amazon Linux, and set the instance type to t2.micro. ...

How to Register and Delete RDF Files in Virtuoso RDF Store Using curl and Python

Overview Notes on how to register and delete RDF files in Virtuoso RDF store using curl and Python. The following was used as a reference. https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFInsert#HTTP PUT curl As described on the above page. First, create myfoaf.rdf as sample data for registration. <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"> <foaf:Person rdf:about="http://www.example.com/people/中村覚"> <foaf:name>中村覚</foaf:name> </foaf:Person> </rdf:RDF> Next, execute the following command. curl -T ${filename1} ${endpoint}/DAV/home/${user}/rdf_sink/${filename2} -u ${user}:${passwd} A specific example is as follows. curl -T myfoaf.rdf http://localhost:8890/DAV/home/dba/rdf_sink/myfoaf.rdf -u dba:dba Python Here is an execution example. The following uses rdflib to create the RDF file from scratch. Also, by setting the action to delete, you can perform deletion. ...

Building an Inference App Using Hugging Face Spaces and a YOLOv5 Model (Trained on the NDL-DocL Dataset)

Overview I created an inference app using Hugging Face Spaces and the YOLOv5 model (trained on the NDL-DocL dataset) introduced in the following article. You can try it at the following URL. https://huggingface.co/spaces/nakamura196/yolov5-ndl-layout You can also download the source code and trained model from the following URL. We hope this serves as a reference when developing similar applications. https://huggingface.co/spaces/nakamura196/yolov5-ndl-layout The application development referenced the following Space. https://huggingface.co/spaces/pytorch/YOLOv5 Usage You can upload an image or select one from the Examples. The recognition results can be viewed as shown below. ...

Dumping Elasticsearch Data to Local

To dump data from Elasticsearch to local, I used elasticsearch-dump. Here are my notes. https://github.com/elasticsearch-dump/elasticsearch-dump By using the v option as shown below, files created in the container persist on the host side. The limit option and others are optional. docker run -v [absolute path of host directory]:[absolute path in container] --rm -ti elasticdump/elasticsearch-dump --input [source Elasticsearch index endpoint] --output=[absolute path in container]/[output file name].json --limit 10000 Specifically, it looks like the following. ...

Building a Layout Extraction Model Using the NDL-DocL Dataset and YOLOv5

Overview I built a layout extraction model using the NDL-DocL dataset and YOLOv5. https://github.com/ndl-lab/layout-dataset https://github.com/ultralytics/yolov5 You can try this model using the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/NDL_DocLデータセットとYOLOv5を用いたレイアウト抽出モデル.ipynb This article is a record of the training process above. Creating the Dataset The NDL-DocL dataset in Pascal VOC format is converted to YOLO format. For this method, refer to the following article. In addition to the conversion from Pascal VOC format to COCO format, conversion from COCO format to YOLO format was added. ...

Getting a Google Drive Folder ID from a Path Using Google Colab

This is based on the following page. https://stackoverflow.com/questions/67324695/is-there-a-way-to-get-the-id-of-a-google-drive-folder-from-the-path-using-colab By writing the following code, you can get a Google Drive folder ID from a path. # ドライブのマウント from google.colab import drive drive.mount('/content/drive') # koraのインストール !pip install kora from kora.xattr import get_id # 例）マイドライブへのidを取得する path = "/content/drive/MyDrive" fid = get_id(path) print("https://drive.google.com/drive/u/1/folders/{}".format(fid)) You can also try it from the following notebook. I hope you find this helpful. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/パスからGoogle_DriveのフォルダのIDを取得.ipynb

Conversion and Visualization of the NDL-DocL Dataset (Document Image Layout Dataset)

I created a notebook that converts Pascal VOC format XML files to COCO format JSON files and visualizes the contents of the NDL-DocL Dataset (Document Image Layout Dataset) published by NDL Lab. https://github.com/nakamura196/ndl_ocr/blob/main/NDL_DocLデータセット(資料画像レイアウトデータセット)の変換と可視化.ipynb By opening the above notebook and pressing “Runtime” > “Run all cells,” you can perform the conversion and visualization. By using the “/content/img” folder and “/content/dataset_kotenseki.json” file created after execution, you can use the data in machine learning programs that require COCO format data. ...

Hosting Hugging Face Models on AWS Lambda for Serverless Inference

Overview This is a personal note on hosting Hugging Face models on AWS Lambda for serverless inference, based on the following article. https://aws.amazon.com/jp/blogs/compute/hosting-hugging-face-models-on-aws-lambda/ Additionally, I cover providing an API using Lambda function URLs and CloudFront. Hosting Hugging Face Models on AWS Lambda Preparation For this section, I referred to the document introduced at the beginning. https://aws.amazon.com/jp/blogs/compute/hosting-hugging-face-models-on-aws-lambda/ First, run the following commands. I created a virtual environment called venv, but this is not strictly required. ...

I Created a Program to Extract Differences Between Two Texts

Overview I created a program to extract differences between two texts. You can use it from the following Google Colab notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/校異情報の生成.ipynb A well-known service for this purpose is “difff”, but this time I implemented it using Python. https://difff.jp/ For calculating the differences between texts, I used difflib.SequenceMatcher. https://docs.python.org/ja/3/library/difflib.html Usage You can choose between two output formats: HTML files and TEI files. HTML Here is an example of the HTML file output. ...

Trying Omeka Classic as a Headless CMS

Overview Omeka S and Omeka Classic are very useful tools for building digital archives and for humanities (informatics) research. https://omeka.org/ They come with a REST API as standard and have high extensibility through the addition of modules and plugins. Various existing assets can also be used, including IIIF-related tools, transcription support tools, and tools for handling spatiotemporal information. On the other hand, I (personally) feel that theme development for changing the appearance of sites requires knowledge of PHP and Omeka, making it relatively difficult. On this point, the Headless CMS approach, where the backend and frontend are separated, has been gaining popularity in recent years. ...

Created an Image Comparison Tool Using Mirador 3

I created an image comparison tool using Mirador 3. The URL is as follows. https://ldas-jp.github.io/viewer/input/ The GitHub repository URL is as follows. https://github.com/ldas-jp/viewer Below is the input form. You specify the URLs of the IIIF manifest files and the Canvas URIs for the images you want to compare. You can check input examples by clicking the buttons under “Examples.” Clicking the “Open” button launches Mirador 3 as shown below. You can compare images based on the input information. ...

Bulk Registration of Annotations Using the IIIF Toolkit for Omeka Classic

Introduction This article is primarily a memorandum. There may be many unclear points, so please bear with me. In particular, I hope this serves as a useful reference for how to use the annotation endpoint used by the IIIF Toolkit, as introduced below. https://github.com/utlib/IiifItems/wiki/The-Mirador-Omeka-Annotator-Endpoint Overview The IIIF Toolkit plugin for Omeka Classic is a very useful tool that can load IIIF manifest files and add annotations to images. https://zenn.dev/nakamura196/books/2a0aa162dcd0eb/viewer/b37a8c This article covers how to bulk register annotations that were created independently of Omeka Classic into Omeka Classic. ...

Building an Omeka Classic Site Using Amazon Lightsail (Including Custom Domain + SSL)

Overview I summarized how to build Omeka S using Amazon Lightsail in the following article. This time, I will introduce how to build Omeka Classic using Amazon Lightsail. As described in the following book, Omeka Classic is useful for building annotation environments using the IIIF Toolkit. https://zenn.dev/nakamura196/books/2a0aa162dcd0eb Amazon Lightsail Creating an Instance Access the following page. https://lightsail.aws.amazon.com/ls/webapp/home/instances Then click the “Create Instance” button. Under “Select a blueprint,” choose “LAMP (PHP 7).” ...

NDL OCR Now Supports Ruby (Furigana) Text Extraction

Overview For NDL OCR, the default setting previously did not include ruby (furigana) text extraction. Thanks to the cooperation of the NDL team, it is now possible to configure whether or not to perform text extraction for ruby. https://github.com/ndl-lab/ndlocr_cli/ Setting the following to True in config.yaml enables the ruby text extraction feature. yield_block_rubi: False Please note the following caveats when using this feature: Ruby text is not always split at the exact kanji positions where furigana is placed; multiple ruby sections may be merged into a single output Because ruby characters are small, they may sometimes be output as a placeholder character Tutorial Notebook Updates The ruby text extraction option has also been added to the Google Colab tutorial. ...

Aggregations with Different Keys and Values (Labels and IDs) in Elasticsearch

Overview I am currently working on updating the search application for the Cultural Japan project, and I needed to perform aggregation on multilingual data. This article is a memo of the investigation results regarding the methods. Data For the data, we assume a case where the agential (indicating a person) field has values for id, ja, and en. { "agential": [ { "ja": "葛飾北斎", "en": "Katsushika, Hokusai", "id": "chname:葛飾北斎" } ] } For the above data, we want to perform filtering by id while displaying the ja or en value according to the language setting. ...

Bug and Fix for Omeka S Bulk Import

The Bulk Import module for batch registration of items and media in Omeka S has a bug in versions 3.3.28.0 through 3.3.33.2 that prevents media from being registered. If you need to register media, you will need a workaround such as using version 3.3.27.0 or earlier. After creating an issue about this problem, the bug was promptly fixed: https://gitlab.com/Daniel-KM/Omeka-S-module-BulkImport/-/issues/10 As of July 1, only the source code on GitLab has been updated, but it should be added to the GitHub Releases soon. Please be aware of this when using this module. ...

Created a Video on How to Use the NDLOCR App with Google Colab

I created a video on how to use the NDLOCR app with Google Colab. I hope it serves as a useful reference. https://youtu.be/46p7ZZSul0o The blog used in the video is the following. Note that the “Initial Setup” portion has been trimmed in the video. In reality, it takes about 3-5 minutes, so please be aware.

Scheduled Backup of Omeka S Data Using AWS Copilot

Overview I previously created a program to download Omeka S data. This time, I use AWS Copilot to run the above program on a scheduled basis. Installing AWS Copilot Please refer to the following. https://docs.aws.amazon.com/ja_jp/AmazonECS/latest/developerguide/AWS_Copilot.html Preparing Files Create three files in any location: Dockerfile, main.sh, and .env. Dockerfile FROM python:3 COPY *.sh . CMD sh main.sh main.sh set -e export output_dir=../docs # Program to download data from Omeka S export repo_tool=https://github.com/nakamura196/omekas_backup.git dir_tool=tool dir_dataset=dataset # If folder exists if [ -d $dir_tool ]; then rm -rf $dir_tool rm -rf $dir_dataset fi # clone git clone --depth 1 $repo_tool $dir_tool git clone --depth 1 $repo_dataset $dir_dataset # requirements.txt cd $dir_tool pip install --upgrade pip pip install -r requirements.txt # Execute cd src sh main.sh # copy odir=../../$dir_dataset/$subdir mkdir -p $odir cd $odir cp -r ../../$dir_tool/data . cp -r ../../$dir_tool/docs . # git git status git add . git config user.email "$email" git config user.name "$name" git commit -m "update" git push # Cleanup cd ../../ rm -rf $dir_tool rm -rf $dir_dataset .env api_url=https://dev.omeka.org/omeka-s-sandbox/api github_url=https://<personal-access-token>@github.com/<username>/<repository-name>.git username=nakamura email=nakamura@example.org dirname=dev The following is an explanation of the parameters. ...