Converting TEI/XML Files to EPUB Using Python

Overview I had the opportunity to convert TEI/XML files to EPUB using Python, so here are my notes. While Oxygen XML Editor is one method for converting TEI/XML files to EPUB, this time I used the Python library “EbookLib.” I referenced the following article. https://dev.classmethod.jp/articles/try-create-epub-by-python-ebooklib/ In particular, this time the goal is to create a vertical-text EPUB from the TEI/XML files published in the “Koui Genji Monogatari Text Data Repository.” ...

September 30, 2022 · 1 min · Nakamura

How to Extract and Process Only Text Strings from XML Files

I had the opportunity to extract and process only text strings from XML files. For this need, I was able to achieve it with the following script. s e o l u e p m e = n t B s e a = u t s i o f u u p l . S f o i u n p d ( C o h p i e l n d ( r p e a n t ( h t , e ' x r t ' = ) T , r u " e x , m l r " e ) c u r s i v e = T r u e ) The key point is passing text=True, which allows you to retrieve only text nodes. ...

September 22, 2022 · 1 min · Nakamura

How to Set the xml:id Attribute with BeautifulSoup

This is a memo on how to set the xml:id attribute with BeautifulSoup. The following method causes an error. f s s p r o o r o u u i m p p n . t b = a ( s p s 4 B p o e e u i a n p m u d ) p t ( o i s r f o t u u l p B S . e o n a u e u p w t ( _ i f t f e a u a g l t ( S u " o r p u e " p s , = " a x b m c l = " " ) x y z " , x m l : i d = " a b c " ) ) Writing it as follows works correctly. ...

August 30, 2022 · 2 min · Nakamura

I Created a Program to Extract Differences Between Two Texts

Overview I created a program to extract differences between two texts. You can use it from the following Google Colab notebook. https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/校異情報の生成.ipynb A well-known service for this purpose is “difff”, but this time I implemented it using Python. https://difff.jp/ For calculating the differences between texts, I used difflib.SequenceMatcher. https://docs.python.org/ja/3/library/difflib.html Usage You can choose between two output formats: HTML files and TEI files. HTML Here is an example of the HTML file output. ...

July 14, 2022 · 5 min · Nakamura

Created a Sample Repository for Running XSLT in Node.js

I created a sample repository for running XSLT in Node.js. https://github.com/ldasjp8/nodejs-xslt We hope this is helpful when processing TEI/XML files and similar in Node.js.

April 8, 2022 · 1 min · Nakamura