Tech | Digital Archive Systems Tech Blog

Triggering GitHub Actions from Drupal Events

Overview This is a memorandum on how to trigger GitHub Actions from Drupal events. The following site was helpful: https://qiita.com/hmaruyama/items/3d47efde4720d357a39e Pipedream Configuration Create a workflow that includes a trigger and a custom_request. For the trigger, please refer to the following: https://qiita.com/hmaruyama/items/3d47efde4720d357a39e#pipedream側の設定 In custom_request, configure the dispatch settings. https://docs.github.com/ja/rest/repos/repos?apiVersion=2022-11-28#create-a-repository-dispatch-event Configure the settings as follows: curl -L \ -X POST \ -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer <YOUR-TOKEN>" \ -H "X-GitHub-Api-Version: 2022-11-28" \ https://api.github.com/repos/OWNER/REPO/dispatches \ -d '{"event_type":"webhook"}' ...

Inference App Using a YOLOv5 Model (Character Region Detection)

Overview The character region detection app is published at the following link. https://huggingface.co/spaces/nakamura196/yolov5-char The above app had stopped working, so I fixed it following the same procedure as in the following article. The model used in this app was built using the “Japanese Classical Character Dataset” (held by NIJL and others / processed by CODH) doi:10.20676/00000340. I also made some minor improvements during this fix, which I will introduce here. ...

Launching Jupyter Lab on mdx

Overview I had an opportunity to launch Jupyter Lab on mdx, so here are my notes. Please also refer to the following for mdx setup. References The following video was very helpful. https://youtu.be/-KJwtctadOI?si=xaKajk79b1MxTpJ6 Setup On the Server Install pip sudo apt install python3-pip Add to the PATH nano ~/.bashrc export PATH="$HOME/.local/bin:$PATH" source ~/.bashrc The following command launches Jupyter Lab. jupyter-lab Local Machine Connect via SSH with the following command. ssh -N -L 8888:localhost:8888 mdxuser@xxx.yyy.zzz.lll -i ~/.ssh/mdx/id_rsa Then, access the address displayed in the server console. ...

Fixing an Inference App Using Hugging Face Spaces and a YOLOv5 Model (Trained on NDL-DocL Dataset)

Overview In the following article, I introduced an inference app using Hugging Face Spaces and a YOLOv5 model trained on the NDL-DocL dataset. This app had stopped working, so I fixed it to make it operational again. https://huggingface.co/spaces/nakamura196/yolov5-ndl-layout Here are my notes on the changes made during this fix. Changes The modified app.py is shown below. import gradio as gr from PIL import Image import yolov5 import json model = yolov5.load("nakamura196/yolov5-ndl-layout") def yolo(im): results = model(im) # inference df = results.pandas().xyxy[0].to_json(orient="records") res = json.loads(df) im_with_boxes = results.render()[0] # results.render() returns a list of images # Convert the numpy array back to an image output_image = Image.fromarray(im_with_boxes) return [ output_image, res ] inputs = gr.Image(type='pil', label="Original Image") outputs = [ gr.Image(type="pil", label="Output Image"), gr.JSON() ] title = "YOLOv5 NDL-DocL Datasets" description = "YOLOv5 NDL-DocL Datasets Gradio demo for object detection. Upload an image or click an example image to use." article = "YOLOv5 NDL-DocL Datasets is an object detection model trained on the <a href=\"https://github.com/ndl-lab/layout-dataset\">NDL-DocL Datasets</a>." examples = [ ['『源氏物語』(東京大学総合図書館所蔵).jpg'], ['『源氏物語』(京都大学所蔵).jpg'], ['『平家物語』(国文学研究資料館提供).jpg'] ] demo = gr.Interface(yolo, inputs, outputs, title=title, description=description, article=article, examples=examples) demo.launch(share=False) First, due to Gradio version upgrades, I changed gr.inputs.Image to gr.Image and similar updates. ...

Handling ultralyticsplus: ValueError: Invalid CUDA 'device=0' requested...

Overview I have published an inference app using YOLOv8 at the following link: https://huggingface.co/spaces/nakamura196/yolov8-ndl-layout Initially, the following error occurred: ValueError: Invalid CUDA 'device=0' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU. torch.cuda.is_available(): False torch.cuda.device_count(): 0 os.environ['CUDA_VISIBLE_DEVICES']: None See https://pytorch.org/get-started/locally/ for up-to-date torch install instructions if no CUDA devices are seen by torch. This error was resolved by adding device as follows: ...

Prototyping entity-lookup Using the Japan Search Utilization Schema

Overview This is a continuation of the following article. I will prototype a package that performs CWRC entity-lookup using the Japan Search utilization schema. Demo You can try it on the following page. https://nakamura196.github.io/nuxt3-demo/entity-lookup/ Entity-lookup is performed against JPS, Wikidata, and VIAF for each type such as Person, Place, and Organization. Library It is published at the following location. https://github.com/nakamura196/jps-entity-lookup Based on the repository https://github.com/cwrc/wikidata-entity-lookup already published by CWRC, I mainly modified the following file to match the Japan Search utilization schema. ...

Trying cwrc's wikidata-entity-lookup

Overview This is a continuation of the following article. One of the features of LEAF-WRITER is described as follows: the ability to look up and select identifiers for named entity tags (persons, organizations, places, or titles) from the following Linked Open Data authorities: DBPedia, Geonames, Getty, LGPN, VIAF, and Wikidata. This feature uses libraries such as the following. https://github.com/cwrc/wikidata-entity-lookup I tried out this feature. Usage npm packages are published at the following locations. ...

Trying the CWRC XML Validator API

Overview One of the editors for TEI/XML is LEAF-WRITER. https://leaf-writer.leaf-vre.org/ It is described as follows: The XML & RDF online editor of the Linked Editing Academic Framework The GitLab repository is below. https://gitlab.com/calincs/cwrc/leaf-writer/leaf-writer One of the features of this tool is described as: continuous XML validation This validation appears to use the following API. https://validator.services.cwrc.ca/ The library seems to be: https://www.npmjs.com/package/@cwrc/leafwriter-validator This time, I tried the above API. ...

RELAX NG and Schematron

Overview When creating TEI/XML with oXygen XML Editor, the following template is generated. <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>Title</title> </titleStmt> <publicationStmt> Publication Information </publicationStmt> <sourceDesc> Information about the source </sourceDesc> </fileDesc> </teiHeader> <text> <body> Some text here. </body> </text> </TEI> I was curious about the following difference, so I am sharing the results of querying GPT-4. <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> Answer The difference between the 2nd and 3rd lines is the namespace specified in the schematypens attribute. Details are explained below. ...

Using the Docker Version of TEI Publisher

Overview I had an opportunity to use the Docker version of TEI Publisher, so here are my notes. https://teipublisher.com/exist/apps/tei-publisher-home/index.html TEI Publisher is described as follows. TEI Publisher facilitates the integration of the TEI Processing Model into exist-db applications. The TEI Processing Model (PM) extends the TEI ODD specification format with a processing model for documents. That way intended processing for all elements can be expressed within the TEI vocabulary itself. It aims at the XML-savvy editor who is familiar with TEI but is not necessarily a developer. ...

Formatting XML Strings in Python

Overview Notes on programs for formatting XML strings in Python. Program 1 I referenced the following. https://hawk-tech-blog.com/python-learn-prettyprint-xml/ I added processing to remove unnecessary blank lines. from xml.dom import minidom import re def prettify(rough_string): reparsed = minidom.parseString(rough_string) pretty = re.sub(r"[\t ]+\n", "", reparsed.toprettyxml(indent="\t")) # Remove unnecessary line breaks after indentation pretty = pretty.replace(">\n\n\t<", ">\n\t<") # Remove unnecessary blank lines pretty = re.sub(r"\n\s*\n", "\n", pretty) # Replace consecutive line breaks (including blank lines) with a single line break return pretty Program 2 I referenced the following. https://qiita.com/hrys1152/items/a87b4ca3c74ec4997f66 When processing TEI/XML, I recommend registering the namespace. ...

How to Convert CMYK Color Images Without Color Inversion

Overview For example, when delivering images via IIIF, performing the following conversion on CMYK color images using ImageMagick would sometimes result in inverted colors. convert source_image.tif -alpha off -define tiff:tile-geometry=256x256 -compress jpeg 'ptif:output_image.tif' Original image (Using an image published on Nuno LAB..) Display example in Image Annotator (created by Masahide Kanzaki) This is not a problem with image servers such as Cantaloupe Image Server or IIPImage, nor with viewers like Image Annotator, Mirador, or Universal Viewer. Rather, the issue lies in the generated tiled TIFF images. ...

Counting Triples in an RDF Store 2: Co-occurrence Frequency

Overview I had the opportunity to count co-occurrence frequencies for RDF triples, so here are my notes. Following the previous article, I will again use the Japan Search RDF store as an example. Example 1 The following query counts the number of triples among sword-type instances that share a common creator (schema:creator). The filter avoids counting identical instances and prevents duplicate counting. select (count(*) as ?count) where { ?entity1 a type:刀剣; schema:creator ?value . ?entity2 a type:刀剣; schema:creator ?value . FILTER(?entity1 != ?entity2 && ?entity1 < ?entity2) } https://jpsearch.go.jp/rdf/sparql/easy/?query=select+(count(*)+as+%3Fcount)+where+{ ++%3Fentity1+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++%3Fentity2+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++FILTER(%3Fentity1+!%3D+%3Fentity2+%26%26+%3Fentity1+<+%3Fentity2) } ...

Counting the Number of Triples in an RDF Store

Overview Here are my notes on how to count the number of triples in an RDF store. This time, we will use the Japan Search RDF store as an example. https://jpsearch.go.jp/rdf/sparql/easy/ Number of Triples The following query counts the number of triples: SELECT (COUNT(*) AS ?NumberOfTriples) WHERE { ?s ?p ?o . } The result is: https://jpsearch.go.jp/rdf/sparql/easy/?query=SELECT+(COUNT(*)+AS+%3FNumberOfTriples) WHERE+{ ++%3Fs+%3Fp+%3Fo+. } At the time of writing this article (May 6, 2024), there were 1,280,645,565 triples (approximately 1.28 billion). ...

Trying Out TEIGarage

Overview TEIGarage is described as follows. https://github.com/TEIC/TEIGarage/ TEIGarage is a webservice and RESTful service to transform, convert and validate various formats, focussing on the TEI format. TEIGarage is based on the proven OxGarage. Trying It Out You can try it out on the following page. https://teigarage.tei-c.org/ We will use the “TEI Minimal” ODD file published at the following URL. This file is also used as one of the presets in Roma. ...

Handling the Error: Input value "page" contains a non-scalar value

Overview I addressed the same error in the following article. However, there were cases where the error could not be resolved even after applying the above fix, so I describe additional measures here. Error Details The error details are as follows. In particular, it occurred when jsonapi_search_api_facets was enabled. { "jsonapi": { "version": "1.0", "meta": { "links": { "self": { "href": "http://jsonapi.org/format/1.0/" } } } }, "errors": [ { "title": "Bad Request", "status": "400", "detail": "Input value \"page\" contains a non-scalar value.", "links": { "via": { "href": "http://localhost:61117/web/jsonapi/index/document?page%5Blimit%5D=24&sort=field_id" }, "info": { "href": "http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.1" } }, "source": { "file": "/app/vendor/symfony/http-kernel/HttpKernel.php", "line": 83 }, "meta": { "exception": "Symfony\\Component\\HttpFoundation\\Exception\\BadRequestException: Input value \"page\" contains a non-scalar value. in /app/vendor/symfony/http-foundation/InputBag.php:38\nStack trace:\n#0 /app/web/modules/contrib/facets/src/Plugin/facets/url_processor/QueryString.php(92): Symfony\\Component\\HttpFoundation\\InputBag->get('page')\n#1 /app/web/modules/contrib/facets/src/Plugin/facets/processor/UrlProcessorHandler.php(76): Drupal\\facets\\Plugin\\facets\\url_processor\\QueryString->buildUrls(Object(Drupal\\facets\\Entity\\Facet), Array)\n#2 /app/web/modules/contrib/facets/src/FacetManager/DefaultFacetManager.php(339): ... Solution I modified the buildUrls method in the file mentioned above. ...

Bulk Deleting S3 Buckets Using AWS CLI

To list S3 buckets using AWS CLI and delete buckets based on a specific pattern, you can follow the steps below. Here, we explain how to delete buckets whose names start with wby. Prerequisites AWS CLI is installed. Appropriate AWS credentials and access permissions are configured. Step 1: List Buckets First, use the installed AWS CLI to list all S3 buckets: aws s3 ls Step 2: Delete Matching Buckets To delete buckets starting with wby, use a shell script to filter matching buckets and delete them. ...

An Example Analysis of Texts Published in "SAT Daizokyo Text Database 2018"

Overview “SAT Daizokyo Text Database 2018” is described as follows. https://21dzk.l.u-tokyo.ac.jp/SAT2018/master30.php This site is the 2018 version of the digital research environment provided by the SAT Daizokyo Text Database Research Society. Since April 2008, the SAT Daizokyo Text Database Research Society has provided a full-text search service for all 85 volumes of the text portion of the Taisho Shinshu Daizokyo, while enhancing usability through collaboration with various web services and exploring the possibilities of web-based humanities research environments. In SAT2018, we have incorporated new services including collaboration with high-resolution images via IIIF using recently spreading machine learning technology, publication of modern Japanese translations understandable by high school students with linkage to the original text. We have also updated the Chinese characters in the main text to Unicode 10.0 and integrated most functions of the previously published SAT Taisho Image Database. However, this release also provides a framework for collaboration, and going forward, data will be expanded along these lines to further enhance usability. The web services provided by our research society rely on services and support from various stakeholders. For the new services in SAT2018, we received support from the Institute for Research in Humanities regarding machine learning and IIIF integration, and from the Japan Buddhist Federation and Buddhist researchers nationwide for creating modern Japanese translations. We hope that SAT2018 will be useful not only for Buddhist researchers but also for various people interested in Buddhist texts. Furthermore, we would be delighted if the approach to applying technology to cultural materials presented here serves as a model for humanities research. ...

Parsing XML Strings in Node.js

Overview To parse XML strings and extract information from them in Node.js, I recommend using the xmldom library. This allows you to work with XML in a way similar to how you manipulate the DOM in a browser. Below is how to set up a function to parse XML and extract elements, focusing on “PAGE” tags, using xmldom. Install the xmldom library: First, install xmldom, which is needed to parse XML strings. npm install xmldom Use xmldom to parse XML and extract the required elements. const { DOMParser } = require('xmldom'); const xmlString = "..."; // DOMParserを使用してXML文字列を解析 const parser = new DOMParser(); const xmlDoc = parser.parseFromString(xmlString, 'text/xml'); // 全てのPAGE要素を取得 const pages = xmlDoc.getElementsByTagName('PAGE'); // 発見されたPAGE要素の数をログに記録（例） console.log('PAGE要素の数:', pages.length); In this example, the basic function logs the XML string, parses it into a document, iterates over each “PAGE” element, and logs its attributes and content. The processing within the loop can be customized based on specific requirements, such as extracting particular details from each page. ...

LlamaIndex+GPT4+gradio

Overview I had the opportunity to use LlamaIndex, GPT4, and gradio together, so this is a memo of the process. Since the text used was small in size, the results are accordingly modest, but I prototyped a chatbot for Shibusawa Eiichi. Background I referred to the following article. https://qiita.com/DeepTama/items/1a44ddf6325c2b2cd030 Based on the above, I made modifications to work with libraries as of April 20, 2024. The notebook is published at the following location. ...