Python

Adding Content to Drupal Using Python

Overview I had an opportunity to add content to Drupal using Python, so this is a memo of the process. I referenced the following article. https://weimingchenzero.medium.com/use-python-to-call-drupal-9-core-restful-api-to-create-new-content-9f3fa8628ab4 Preparing Drupal I set it up on Amazon Lightsail. The following article is a useful reference. https://annai.co.jp/article/use-aws-lightsail Modules Install the following modules. HTTP Basic Auth JSON:API RESTful Web Services Serialization Changing JSON:API Settings Access the following page to change the settings. </admin/config/services/jsonapi> Python Set {IP address or domain name} and {password} as appropriate. ...

Creating RDF from Excel

Overview For creating RDF data, I prototyped a Python library that converts data created in Excel to RDF data. It is still a work in progress, but here are my notes. Notebook You can try it from the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/ExcelからRDFデータを作成する.ipynb Source Excel Data Create an Excel file like the following. https://docs.google.com/spreadsheets/d/16SufG69_aZP0u0Kez8bisImGvVb4-z990AEPesdVxLo/edit#gid=0 In the above example, the prefix information used is organized in a sheet named “prefix.” The actual data is entered in a sheet named “target.” Referencing the specifications of Omeka S’s Bulk Import, language labels like “@ja” and types like “^^uri” are specified. ...

How to Extract respStmt name Values from TEI/XML Files (Explained by GPT-4)

How to Extract respStmt name Values from TEI/XML Files: Approaches Using BeautifulSoup and ElementTree in Python This article introduces how to extract respStmt name values from TEI/XML files using Python’s BeautifulSoup and ElementTree. Method 1: Using ElementTree First, we extract the respStmt name value using Python’s standard library xml.etree.ElementTree. i # t r # n # n # i e m r o s a f l p L e o D E m D s o o e t e = x e i n e r a f t s a p : p t d = = i { r = p m r r n ' a l e i i x t E t e t c r a n n m h T r e t o y i t t l e . e t i o s ( ( . p e h ' t t t n " e X a . e : h . h n a T t M r g e f e o m h r L s e n ' i t e e e e t a h r n n . e f ( r m t e d a N t n . i ' o e t s ( m o e a E l y o s p p ' e n x m l e o t p : S . e t e e u ( a / t / t : ) m r ) c / m / e t e _ e w t t x a n f w e t g t i w n i T l . a : w r e t m r a e . e e e s e x i s m - v p n a l c a S o s ' . l t t ) o u m E r e t f T g / o / t u n e n s i d / : . 1 n " . a ) 0 m ' e } ' , n s ) Method 2: Using BeautifulSoup Next, we extract the respStmt name value using BeautifulSoup. First, make sure the beautifulsoup4 and lxml libraries are installed. If they are not installed, you can install them with the following command. ...

Memo on Using nbdev

Overview When creating Python packages, I use nbdev. https://nbdev.fast.ai/ nbdev is described as follows: Write, test, document, and distribute software packages and technical articles — all in one place, your notebook. This article serves as a memo when using nbdev. Installation The following tutorial page is a helpful reference. https://nbdev.fast.ai/tutorials/tutorial.html Below is a brief overview of the workflow. After installing the related tools, create a GitHub repository, clone it, and then execute the following in the cloned directory. ...

Publishing Images Using IIIF Image API Level 0

Overview IIIF Image API level 0 delivers images using pre-generated static tile images. This enables image publishing using only static file hosting services such as GitHub Pages or Amazon S3. However, it has the drawback of not being able to extract arbitrary regions of images. This article introduces an example of publishing images using IIIF Image API level 0. Tool You can try it with the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/IIIF_Image_API_静的ファイル作成ツール.ipynb This notebook is based on the following script. ...

NDL Classical Text OCR Using Google Colab

Overview I created an NDL “Classical Text” OCR application using Google Colab. You can try it at the following URL. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/NDL古典籍OCRの実行例.ipynb The description of NDL Classical Text OCR is as follows. https://github.com/ndl-lab/ndlkotenocr_cli The notebook was created with reference to @blue0620’s notebook. Thank you! https://twitter.com/blue0620/status/1617888733323485184 In the notebook I created, I added support for additional input formats and a feature to save to Google Drive. How to Use The usage is almost the same as the NDLOCR application. Please refer to the following video. ...

Validating XML Files Using the JPCOAR Schema

Overview JPCOAR Schema publishes XML Schema Definitions in the following repository. Thank you for creating the schema and making the data available. https://github.com/JPCOAR/schema This article is a memo of trying XML file validation using the above schema. (Since this is my first time doing this kind of validation, it may contain inaccurate terminology or information. I apologize.) A Google Colab notebook is also prepared. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/JPCOARスキーマを用いたxmlファイルのバリデーション.ipynb Preparation Clone the repository c g d i t / c c o l n o t n e e n t h / t t p s : / / g i t h u b . c o m / J P C O A R / s c h e m a . g i t Install the library ...

Trying the jingtrang Library for RELAX NG Schema: Creating RNG Files

Overview In the following article, I performed XML file validation using jingtrang and RNG files. Since this jingtrang library can create RNG files from XML files, I decided to try it out. I also prepared a Google Colab notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/jingtrangを試す：作成編.ipynb Creating an RNG File As the source file for creating the RNG file, I prepared the following: < r o o t > < t i t l e > a a a < / t i t l e > < / r o o t > For the above file, execute the following: ...

Trying the jingtrang Library for RELAX NG Schema: Validation

Overview I had an opportunity to create an XML file conforming to a specific schema, and needed to verify that the XML file matched the schema. To meet this requirement, I tried the jingtrang library for working with RELAX NG schemas, so here are my notes: https://pypi.org/project/jingtrang/ I also prepared a Google Colab notebook: https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/jingtrangを試す.ipynb Trying Validation # p # w # w i g g ラ p r e v e イ n t a t ブ i g l ラ n フ h i h リ s ァ t d t の t イ t a t イ a ル p t p ン l の s i s ス l ダ : o : トウ / n / ー j ン / 対 / ル i ロ r 象 k n ー a の o g ド w X u t （ . M i r t g L g a e i フ e n i t ァ n g _ h イ j a u ル i l b の m l u 用 o を s 意 n 使 e （ o 用 r 校 g ） c 異 a o 源 t n 氏 a t 物 r e 語 i n テ . t キ g . ス i c ト t o の h m ダ u / ウ b n ン . a ロ i k ー o a ド / m ） t u e r i a / 1 0 9 1 6 . / x t m e l s t 2 0 2 1 / m a i n / t e i _ a l l . r n g Passing Example Running the following produced no output: ...

Converting Word to TEI/XML

Overview I had an opportunity to convert Word files to TEI/XML files. Upon investigation, in addition to official TEI tools such as TEIGarage Conversion, I found a conversion example in TEI Publisher: https://teipublisher.com/exist/apps/tei-publisher/test/test.docx.xml The above example appeared to convert Word style information into TEI tags, so I tried this approach. For this project, I used the python-docx library with the goal of using it independently of TEI Publisher. Word File I created a prototype Word file like the one below. All styles are provisional, but I created styles such as “tei:persName” and “tei:warichu” and changed their visual styling such as color. The mechanism works by applying styles to perform simple structuring. ...

Running Tesseract on Google Colab (with Japanese Support)

I created a notebook for running Tesseract on Google Colab. It also supports Japanese. We hope this serves as a useful reference. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/Tesseractを試す.ipynb At the end, I also introduce a flow for converting hocr files to alto format XML files. Specifically, the following tool is used: https://digi.bib.uni-mannheim.de/ocr-fileformat/ We hope this serves as a useful reference.

Workaround for HuggingFace Trainer() Not Starting When Using Vertex AI Workbench

I encountered an issue where HuggingFace’s Trainer() would not start when using Google Cloud’s Vertex AI Workbench. A similar bug was reported on the following page: https://stackoverflow.com/questions/73415068/huggingface-trainer-does-nothing-only-on-vertex-ai-workbench-works-on-colab Initially, I had selected the “PyTorch” environment as shown below, and this is where the bug occurred: As described in the article above, switching to the “Python” environment resolved the issue: Note that when using this environment, you first need to run the following: ...

Trying the ResourceSync Python Library

Overview This is a memo from trying out “py-resourcesync,” a Python library for ResourceSync. https://github.com/resourcesync/py-resourcesync Setup g c p i d y t t p h c y o l - n o r n e s e s e o t h u u t r p t c p e i s s n : y s / n t / c a g l i l t h u b . c o m / r e s o u r c e s y n c / p y - r e s o u r c e s y n c Execution resourcelist First, create the output resource_dir directory. An ex_resource_dir folder will be created in the current directory. ...

A Python Package for Interacting with the Omeka S REST API

Overview A package has been developed that allows you to operate the Omeka S REST API from Python. https://github.com/wragge/omeka_s_tools Furthermore, based on the above repository, I have created a repository with several additional features. https://github.com/nakamura196/omeka_s_tools2 In this article, I will introduce this repository. Usage Please refer to the following page. https://nakamura196.github.io/omeka_s_tools2/ This repository was developed using nbdev, which allows package development and documentation to proceed in parallel, and I found it to be a very convenient system. ...

Double-Sided Ruby Annotations Using python-docx

This is a memo on how to achieve double-sided ruby (furigana) in Word using python-docx. You can try it from the following notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/python_docxを用いた両側ルビ.ipynb An output example is shown below. An input example is shown below. < < < b p 私 < に / p < が / / o > は r < < / 行 p > r < < / あ p b d u r < < / r r き > u r r r り > o y b b r < < / r < < / r t u ま b b t u ま d > y > u r r r u r r r 場 b b し y > b す y > b b t u b b t u > p y た > 入 p y 。 > y > b y > b l > 。学 l > > 打 p y > 球 p y a 試 a < l > < l > c 験 c / a / a e < e r c r c = / = b e b e " r " > = > = l b a " " e > b r r f o i i t v g g " e h h > " t t ビ > " " リに > > ヤゅダキーう < ウドが / < < く r / / し t r r け > t t ん > > < / r t > The program is still incomplete, but I hope it serves as a helpful reference. ...

Converting TEI/XML Files to EPUB Using Python

Overview I had the opportunity to convert TEI/XML files to EPUB using Python, so here are my notes. While Oxygen XML Editor is one method for converting TEI/XML files to EPUB, this time I used the Python library “EbookLib.” I referenced the following article. https://dev.classmethod.jp/articles/try-create-epub-by-python-ebooklib/ In particular, this time the goal is to create a vertical-text EPUB from the TEI/XML files published in the “Koui Genji Monogatari Text Data Repository.” ...

How to Extract and Process Only Text Strings from XML Files

I had the opportunity to extract and process only text strings from XML files. For this need, I was able to achieve it with the following script. s e o l u e p m e = n t B s e a = u t s i o f u u p l . S f o i u n p d ( C o h p i e l n d ( r p e a n t ( h t , e ' x r t ' = ) T , r u " e x , m l r " e ) c u r s i v e = T r u e ) The key point is passing text=True, which allows you to retrieve only text nodes. ...

How to Set the xml:id Attribute with BeautifulSoup

This is a memo on how to set the xml:id attribute with BeautifulSoup. The following method causes an error. f s s p r o o r o u u i m p p n . t b = a ( s p s 4 B p o e e u i a n p m u d ) p t ( o i s r f o t u u l p B S . e o n a u e u p w t ( _ i f t f e a u a g l t ( S u " o r p u e " p s , = " a x b m c l = " " ) x y z " , x m l : i d = " a b c " ) ) Writing it as follows works correctly. ...

How to Register and Delete RDF Files in Virtuoso RDF Store Using curl and Python

Overview Notes on how to register and delete RDF files in Virtuoso RDF store using curl and Python. The following was used as a reference. https://vos.openlinksw.com/owiki/wiki/VOS/VirtRDFInsert#HTTP PUT curl As described on the above page. First, create myfoaf.rdf as sample data for registration. < r / d r f d : < f R f / : D o f R F a o D f < a F x : f f > m P o : l e a P n r f e s s : r : o n s f n a o o m n a r e > f d > = f 中 " : 村 h a 覚 t b < t o / p u f : t o / = a / " f x h : m t n l t a n p m s : e . / > c w m w / w f . o e a x f a / m 0 p . l 1 e / . " c > o m / p e o p l e / 中村覚 " > Next, execute the following command. ...

I Created a Program to Extract Differences Between Two Texts

Overview I created a program to extract differences between two texts. You can use it from the following Google Colab notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/校異情報の生成.ipynb A well-known service for this purpose is “difff”, but this time I implemented it using Python. https://difff.jp/ For calculating the differences between texts, I used difflib.SequenceMatcher. https://docs.python.org/ja/3/library/difflib.html Usage You can choose between two output formats: HTML files and TEI files. HTML Here is an example of the HTML file output. ...