Tech | デジタルアーカイブシステムの技術ブログ

mdxでJupyter Labを起動する

概要 mdxでJupyter Labを起動する機会がありましたので、備忘録です。 mdxのセットアップは以下も参考にしてください。参考以下の動画がとても参考になりました。 https://youtu.be/-KJwtctadOI?si=xaKajk79b1MxTpJ6 セットアップサーバ上 pipのインストール sudo apt install python3-pip パスを通す nano ~/.bashrc export PATH="$HOME/.local/bin:$PATH" source ~/.bashrc 以下により、juypter labが起動します。 jupyter-lab ローカル以下で、ssh接続します。 ssh -N -L 8888:localhost:8888 mdxuser@xxx.yyy.zzz.lll -i ~/.ssh/mdx/id_rsa その上で、サーバ上のコンソールに表示されているアドレスにアクセスします。 http://localhost:8888/lab?token=xxx 結果、以下のように利用できるようになりました。参考：ファイル転送以下などで、ローカルからサーバへファイル転送を行う。 scp -i ~/.ssh/mdx/id_rsa /path/to/local/image.jpg username@remote_address:/path/to/remote/directory まとめ参考になりましたら幸いです。

Hugging Face SpacesとYOLOv5モデル（NDL-DocLデータセットで学習済み）を使った推論アプリの修正

概要以下の記事でHugging Face Spacesと、以下の記事で紹介したYOLOv5モデル（NDL-DocLデータセットで学習済み）を使った推論アプリを紹介しました。このアプリが動作しなくなっていたため、動作するように修正しました。 https://huggingface.co/spaces/nakamura196/yolov5-ndl-layout この修正で行なった対応についてメモします。修正点修正を加えたapp.pyは以下です。 import gradio as gr from PIL import Image import yolov5 import json model = yolov5.load("nakamura196/yolov5-ndl-layout") def yolo(im): results = model(im) # inference df = results.pandas().xyxy[0].to_json(orient="records") res = json.loads(df) im_with_boxes = results.render()[0] # results.render() returns a list of images # Convert the numpy array back to an image output_image = Image.fromarray(im_with_boxes) return [ output_image, res ] inputs = gr.Image(type='pil', label="Original Image") outputs = [ gr.Image(type="pil", label="Output Image"), gr.JSON() ] title = "YOLOv5 NDL-DocL Datasets" description = "YOLOv5 NDL-DocL Datasets Gradio demo for object detection. Upload an image or click an example image to use." article = "<p style='text-align: center'>YOLOv5 NDL-DocL Datasets is an object detection model trained on the <a href=\"https://github.com/ndl-lab/layout-dataset\">NDL-DocL Datasets</a>.</p>" examples = [ ['『源氏物語』(東京大学総合図書館所蔵).jpg'], ['『源氏物語』(京都大学所蔵).jpg'], ['『平家物語』(国文学研究資料館提供).jpg'] ] demo = gr.Interface(yolo, inputs, outputs, title=title, description=description, article=article, examples=examples) demo.launch(share=False) まず、Gradioのバージョンアップに伴い、gr.inputs.Imageをgr.Imageなどに変更しました。 ...

ultralyticsplus: ValueError: Invalid CUDA 'device=0' requested...への対処

概要 YOLOv8を用いた推論アプリを以下で公開しています。 https://huggingface.co/spaces/nakamura196/yolov8-ndl-layout 当初、以下のエラーが発生しました。 ValueError: Invalid CUDA 'device=0' requested. Use 'device=cpu' or pass valid CUDA device(s) if available, i.e. 'device=0' or 'device=0,1,2,3' for Multi-GPU. torch.cuda.is_available(): False torch.cuda.device_count(): 0 os.environ['CUDA_VISIBLE_DEVICES']: None See https://pytorch.org/get-started/locally/ for up-to-date torch install instructions if no CUDA devices are seen by torch. このエラーがに対して、以下のようにdeviceを追記することで対処できました。 results = model.predict(img, device="cpu") 詳細以下のライブラリを使用しています。 https://github.com/fcakyon/ultralyticsplus そして、以下のように利用したところ、上記のエラーが発生しました。 from ultralyticsplus import YOLO, render_result # load model model = YOLO("nakamura196/yolov8-ndl-layout") img = 'https://dl.ndl.go.jp/api/iiif/2534020/T0000001/full/full/0/default.jpg' results = model.predict(img) そこで、以下のように引数を追記することで、エラーが解消しました。 results = model.predict(img, device="cpu") 補足以下のように、ローカルにあるモデルを使用する際には、device="cpu"がなくても、上記のエラーが発生することなく使用できました。 ...

Japan Search利活用スキーマを使ったentity-lookupの試作

概要以下の記事の続きです。 Japan Searchの利活用スキーマを使って、cwrcのentity-lookupを行うパッケージを試作します。デモ以下のページでお試しいただけます。 https://nakamura196.github.io/nuxt3-demo/entity-lookup/ Person, Place, Organizationなどの種別ごとに、JPS, Wikidata, VIAFにentity-lookupを行います。ライブラリ以下で公開しています。 https://github.com/nakamura196/jps-entity-lookup cwrcですでに公開されていたリポジトリhttps://github.com/cwrc/wikidata-entity-lookupをベースに、主に以下のファイルをJapan Searchの利活用スキーマに合わせて修正しました。 https://github.com/nakamura196/jps-entity-lookup/blob/main/src/index.js インストール方法以下が参考になりました。 https://qiita.com/pure-adachi/items/ba82b03dba3ebabc6312 開発中開発中のライブラリをインストールする場合には、以下のようにインストールしました。 pnpm i /Users/nakamura/xxx/jps-entity-lookup GitHubから GitHubからは以下のようにインストールします。 pnpm i nakamura196/jps-entity-lookup まとめ参考になりましたら幸いです。

cwrcのwikidata-entity-lookupを試す

概要以下の記事の続きです。 LEAF-WRITERの特徴として、以下が挙げられています。 the ability to look up and select identifiers for named entity tags (persons, organizations, places, or titles) from the following Linked Open Data authorities: DBPedia, Geonames, Getty, LGPN, VIAF, and Wikidata. この機能は、以下のようなライブラリが使用されています。 https://github.com/cwrc/wikidata-entity-lookup この機能を試しています。使い方以下などで、npmパッケージが公開されています。 https://www.npmjs.com/search?q=cwrc 上記のリストにはありませんが、今回は以下を対象にします。 https://www.npmjs.com/package/wikidata-entity-lookup 以下でインストールします。 npm i wikidata-entity-lookup wikidataLookup.findPersonは、以下のように実行することができました。 <script lang="ts" setup> // @ts-ignore import wikidataLookup from "wikidata-entity-lookup"; interface Entity { id: string; name: string; description: string; uri: string; } const query = ref<string>(""); const results = ref<Entity[]>([]); const search = () => { wikidataLookup.findPerson(query.value).then((result: Entity[]) => { results.value = result; }); }; </script> デモ Nuxtでの実装例を用意しました。 ...

CWRC XML Validator APIを試す

概要 TEI/XMLを対象としたエディタの一つとして、LEAF-WRITERがあります。 https://leaf-writer.leaf-vre.org/ 以下のように説明されています。 The XML & RDF online editor of the Linked Editing Academic Framework GitLabのリポジトリは以下です。 https://gitlab.com/calincs/cwrc/leaf-writer/leaf-writer このツールの特徴の一つとして、以下が説明されています。 continuous XML validation このvalidationには以下のAPIが使用されているようでした。 https://validator.services.cwrc.ca/ また、ライブラリは以下のようです。 https://www.npmjs.com/package/@cwrc/leafwriter-validator 今回は、上記のAPIを試してみます。試す以下のページが表示されます。 https://validator.services.cwrc.ca/ Try ItのPOSTで以下を試しています。 Schema URL: https://raw.githubusercontent.com/nakamura196/test2021/main/tei_excel.rng Schema Type: RNG_XML Document Content: <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>Title</title> </titleStmt> <publicationStmt> <p>Publication Information</p> </publicationStmt> <sourceDesc> <p>Information about the source</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <p> I am <name>Satoru Nakamura</name>. </p> </body> </text> </TEI> 上記は、http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rngをスキーマとする場合にはエラーが生じませんが、使用可能なタグを限定したhttps://raw.githubusercontent.com/nakamura196/test2021/main/tei_excel.rngではエラーが生じるはずです。 ...

RELAX NGとSchematron

概要 oXygen XML EditorでTEI/XMLを作成すると、以下のようなテンプレートが作成されます。 <?xml version="1.0" encoding="UTF-8"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader> <fileDesc> <titleStmt> <title>Title</title> </titleStmt> <publicationStmt> <p>Publication Information</p> </publicationStmt> <sourceDesc> <p>Information about the source</p> </sourceDesc> </fileDesc> </teiHeader> <text> <body> <p>Some text here.</p> </body> </text> </TEI> この時、以下の違いが気になりましたので、GPT4への問い合わせ結果を共有します。 <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron"?> 回答 2行目と3行目の違いは、schematypens属性に指定されている名前空間です。以下に詳細を説明します。 2行目 <?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?> href : http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng type : application/xml schematypens : http://relaxng.org/ns/structure/1.0 この行は、TEI (Text Encoding Initiative) ドキュメントに対して RELAX NG スキーマを指定しています。RELAX NG はXML文書の構造を検証するためのスキーマ言語です。 ...

Docker版のTEI Publisherを使用する

概要 Docker版のTEI Publisherを使用する機会があったので、備忘録です。 https://teipublisher.com/exist/apps/tei-publisher-home/index.html TEI Publisherは以下のように説明されています。 TEI Publisher facilitates the integration of the TEI Processing Model into exist-db applications. The TEI Processing Model (PM) extends the TEI ODD specification format with a processing model for documents. That way intended processing for all elements can be expressed within the TEI vocabulary itself. It aims at the XML-savvy editor who is familiar with TEI but is not necessarily a developer. （機械翻訳） TEI Publisherは、TEI Processing Modelをexist-dbアプリケーションに統合することを容易にします。TEI Processing Model（PM）は、ドキュメントの処理モデルを備えたTEI ODD仕様形式を拡張します。これにより、すべての要素の意図された処理をTEI語彙自体内で表現することができます。このモデルは、TEIに精通しているが必ずしも開発者ではないXML熟練のエディターを対象としています。 ...

PythonでXML文字列を整形する

概要 PythonでXML文字列を整形するプログラムの備忘録です。プログラム1 以下を参考にしました。 https://hawk-tech-blog.com/python-learn-prettyprint-xml/ 不要な空行を削除する処理などを加えています。 from xml.dom import minidom import re def prettify(rough_string): reparsed = minidom.parseString(rough_string) pretty = re.sub(r"[\t ]+\n", "", reparsed.toprettyxml(indent="\t")) # インデント後の不要な改行を削除 pretty = pretty.replace(">\n\n\t<", ">\n\t<") # 不要な空行を削除 pretty = re.sub(r"\n\s*\n", "\n", pretty) # 連続した改行（空白行を含む）を単一の改行に置換 return pretty プログラム2 以下を参考にしました。 https://qiita.com/hrys1152/items/a87b4ca3c74ec4997f66 TEI/XMLを処理する場合には、名前空間の登録をおすすめします。 import xml.etree.ElementTree as ET # 名前空間の登録 ET.register_namespace('', "http://www.tei-c.org/ns/1.0") tree = ET.ElementTree(ET.fromstring(xml_string)) ET.indent(tree, space=' ') tree.write('output.xml', encoding='UTF-8', xml_declaration=True) まとめ参考になりましたら幸いです。

CMYKカラーの画像から色を反転させないconvertの方法

概要例えばIIIFを用いた画像配信において、CMYKカラーの画像に対して、ImageMagickで以下のような変換処理を行うと、色が反転するケースがありました。 convert source_image.tif -alpha off -define tiff:tile-geometry=256x256 -compress jpeg 'ptif:output_image.tif' 元画像（布LAB.で公開されている画像を利用させていただいています。） Image Annotator（神崎正英氏作成）での表示例これは、Cantaloupe Image ServerやIIPImageなどのイメージサーバ、および、Image AnnotatorやMirador, Universal Viewerなどのビューア側の問題ではなく、作成されるtiled TIFFs画像に問題があるようです。本記事では、この問題への対応方法について説明します。背景同様の不具合は、以下の記事など、いくつかの場所で報告されていました。 https://scrapbox.io/giraffate/ImageMagickでCMYKのJPG画像を合成したら色が反転するバグ解決策として、今回は以下を参考にしました。 https://www.imagemagick.org/discourse-server/viewtopic.php?t=32585 -colorspace sRGBを追加するようです。変換 tiled TIFFsを作成するコマンドは以下を参考にします。 https://samvera.github.io/serverless-iiif/docs/source-images#using-imagemagick 具体的には、以下です。 convert source_image.tif -alpha off -define tiff:tile-geometry=256x256 -compress jpeg 'ptif:output_image.tif' 上記をCMYKカラーの画像に対してそのまま実行すると、冒頭で紹介したように、反転した画像が表示されました。なお、Image ServerにはCantaloupe Image Serverを使用していますが、IIPImageなどでも同様の事象が確認されました。修正した変換コマンド以下のように、-colorspace sRGBを追加します。 convert source_image.tif -alpha off -colorspace sRGB -define tiff:tile-geometry=256x256 -compress jpeg 'ptif:output_image.tif' 結果、以下のように、色が反転せずに、Image AnnotatorなどもIIIF対応ビューアでも表示されるようになりました。参考画像表示の確認にあたり、MiradorやUniversal Viewerでは、IIIFマニフェストファイルのURLを入力することが一般的ですが、Image Annotatorでは、画像のURIを入力することができます。 ...

RDFストアのトリプル数を数える2: 共起頻度

概要 RDFトリプルに対して、共起頻度を数える機会がありましたので、備忘録です。以下の記事に続き、今回もジャパンサーチのRDFストアを例にします。例1 以下は、刀剣タイプのインタンスのうち、共通を作成者（schema:creator ）を持つトリプルの数をカウントしています。フィルタによって、同一のインスタンスを避け、また重複カウントを避けています。 select (count(*) as ?count) where { ?entity1 a type:刀剣; schema:creator ?value . ?entity2 a type:刀剣; schema:creator ?value . FILTER(?entity1 != ?entity2 && ?entity1 < ?entity2) } https://jpsearch.go.jp/rdf/sparql/easy/?query=select+(count(*)+as+%3Fcount)+where+{ ++%3Fentity1+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++%3Fentity2+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++FILTER(%3Fentity1+!%3D+%3Fentity2+%26%26+%3Fentity1+<+%3Fentity2) } 例2 具体的なトリプルを表示してみます。 select ?entity1 ?entity2 ?value where { ?entity1 a type:刀剣; schema:creator ?value . ?entity2 a type:刀剣; schema:creator ?value . FILTER(?entity1 != ?entity2 && ?entity1 < ?entity2) } https://jpsearch.go.jp/rdf/sparql/easy/?query=select+%3Fentity1+%3Fentity2+%3Fvalue+where+{ ++%3Fentity1+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++%3Fentity2+a+type%3A刀剣%3B +++++++++++++schema%3Acreator+%3Fvalue+. ++FILTER(%3Fentity1+!%3D+%3Fentity2+%26%26+%3Fentity1+<+%3Fentity2) } ...

RDFストアのトリプル数を数える

概要 RDFストアのトリプル数を数える方法について、備忘録です。今回は、ジャパンサーチのRDFストアを例にします。 https://jpsearch.go.jp/rdf/sparql/easy/ トリプル数以下でトリプル数をカウントできます。 SELECT (COUNT(*) AS ?NumberOfTriples) WHERE { ?s ?p ?o . } 結果は以下です。 https://jpsearch.go.jp/rdf/sparql/easy/?query=SELECT+(COUNT(*)+AS+%3FNumberOfTriples) WHERE+{ ++%3Fs+%3Fp+%3Fo+. } 本記事の執筆時点（2024年5月6日）において、12億8064万5565トリプルありました。 NumberOfTriples 1280645565 特定のプロパティでどれだけのトリプルが接続されているか次に、特定のプロパティでどれだけのトリプルが接続されているかをカウントしてみます。以下がクエリ例です。 SELECT ?p (COUNT(*) AS ?count) WHERE { ?s ?p ?o . } GROUP BY ?p ORDER BY DESC(?count) 結果は以下です。 https://jpsearch.go.jp/rdf/sparql/easy/?query=SELECT+%3Fp+(COUNT(*)+AS+%3Fcount) WHERE+{ ++%3Fs+%3Fp+%3Fo+. } GROUP+BY+%3Fp ORDER+BY+DESC(%3Fcount) schema:description で接続されるトリプルが399,447,925件、約4億件あることがわかります。 p count schema:description 399447925 rdf:type 84363276 jps:relationType 72908233 jps:value 72214780 schema:name 57377225 schema:provider 52481873 指定したプロパティを使用して、特定のサブジェクトとオブジェクトのタイプの組み合わせをカウントする上記の打ち合わせの概要を知るにあたり、?subject と ?object が schema:description プロパティによって結びつけられている場合のサブジェクトタイプとオブジェクトタイプの組み合わせをカウントします。 SELECT ?subjectType ?objectType (COUNT(*) AS ?count) WHERE { ?subject schema:description ?object . ?subject rdf:type ?subjectType . optional {?object rdf:type ?objectType . } } GROUP BY ?subjectType ?objectType ORDER BY DESC(?count) 結果は以下です。 ...

TEIGarageを試す

概要 TEIGarageは、以下のように説明されています。 https://github.com/TEIC/TEIGarage/ TEIGarage is a webservice and RESTful service to transform, convert and validate various formats, focussing on the TEI format. TEIGarage is based on the proven OxGarage. （機械翻訳）TEIGarageは、TEIフォーマットを中心にさまざまなフォーマットの変換、変換、検証を行うウェブサービスおよびRESTfulサービスです。TEIGarageは、実績のあるOxGarageに基づいています。試す以下のページで試すことができます。 https://teigarage.tei-c.org/ 以下で公開されている「TEI Minimal」のoddファイルを対象にします。このファイルは、Romaのプリセットの一つとしても使用されています。 https://tei-c.org/Vault/P5/current/xml/tei/Exemplars/tei_minimal.odd 上記のファイルをダウンロードします。そして、TEIGarageのサイトにおいて、「Convert from」に「Compiled TEI ODD」、「Convert to」に「xHTML」を選択して、「ファイルを選択」にダウンロードしたoddファイルをアップロードします。ダウンロードされたHTMLファイルはブラウザ等で確認することができます。ちなみに、「Show advanced options」をクリックすると、パラメータのほか、変換に使用するURLが表示されます。 URLはエンコードされているため、デコードすると、以下になります。 https://teigarage.tei-c.org/ege-webservice/Conversions/ODDC:text:xml/TEI:text:xml/xhtml:application:xhtml+xml/conversion?properties=truetrueenfalsedefaulttruetrueenfalsedefault propertiesパラメータの中に、以下のxml記述を確認することができます。 <conversions> <conversion index="0"> <property id="oxgarage.getImages">true</property> <property id="oxgarage.getOnlineImages">true</property> <property id="oxgarage.lang">en</property> <property id="oxgarage.textOnly">false</property> <property id="pl.psnc.dl.ege.tei.profileNames">default</property> </conversion> <conversion index="1"> <property id="oxgarage.getImages">true</property> <property id="oxgarage.getOnlineImages">true</property> <property id="oxgarage.lang">en</property> <property id="oxgarage.textOnly">false</property> <property id="pl.psnc.dl.ege.tei.profileNames">default</property> </conversion> </conversions> Open API 以下にアクセスすると、Open APIに基づき、利用可能なオプション等を確認することができます。 ...

Input value "page" contains a non-scalar value.への対処

概要以下の記事で、同エラーへの対応を行いました。ただし、上記の対応を行なっても、エラーを解決することができないケースがありましたので、追加の対応を記載します。エラーの内容エラーの内容は以下です。特に、jsonapi_search_api_facetsを有効化した際に発生しました。 { "jsonapi": { "version": "1.0", "meta": { "links": { "self": { "href": "http://jsonapi.org/format/1.0/" } } } }, "errors": [ { "title": "Bad Request", "status": "400", "detail": "Input value \"page\" contains a non-scalar value.", "links": { "via": { "href": "http://localhost:61117/web/jsonapi/index/document?page%5Blimit%5D=24&sort=field_id" }, "info": { "href": "http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.1" } }, "source": { "file": "/app/vendor/symfony/http-kernel/HttpKernel.php", "line": 83 }, "meta": { "exception": "Symfony\\Component\\HttpFoundation\\Exception\\BadRequestException: Input value \"page\" contains a non-scalar value. in /app/vendor/symfony/http-foundation/InputBag.php:38\nStack trace:\n#0 /app/web/modules/contrib/facets/src/Plugin/facets/url_processor/QueryString.php(92): Symfony\\Component\\HttpFoundation\\InputBag->get('page')\n#1 /app/web/modules/contrib/facets/src/Plugin/facets/processor/UrlProcessorHandler.php(76): Drupal\\facets\\Plugin\\facets\\url_processor\\QueryString->buildUrls(Object(Drupal\\facets\\Entity\\Facet), Array)\n#2 /app/web/modules/contrib/facets/src/FacetManager/DefaultFacetManager.php(339): ... 対応そこで、上記で言及されている以下のファイルについて、buildUrlsを修正しました。 <?php namespace Drupal\facets\Plugin\facets\url_processor; use Drupal\Core\Cache\UnchangingCacheableDependencyTrait; use Drupal\Core\Entity\EntityTypeManagerInterface; use Drupal\Core\EventSubscriber\MainContentViewSubscriber; use Drupal\facets\Event\ActiveFiltersParsed; use Drupal\facets\Event\QueryStringCreated; use Drupal\facets\Event\UrlCreated; use Drupal\facets\FacetInterface; use Drupal\facets\UrlProcessor\UrlProcessorPluginBase; use Drupal\facets\Utility\FacetsUrlGenerator; use Symfony\Component\DependencyInjection\ContainerInterface; use Symfony\Component\EventDispatcher\EventDispatcherInterface; use Symfony\Component\HttpFoundation\Request; use Drupal\jsonapi\Query\OffsetPage; // 追加 /** * Query string URL processor. * * @FacetsUrlProcessor( * id = "query_string", * label = @Translation("Query string"), * description = @Translation("Query string is the default Facets URL processor, and uses GET parameters, for example ?f[0]=brand:drupal&f[1]=color:blue") * ) */ class QueryString extends UrlProcessorPluginBase { ... /** * {@inheritdoc} */ public function buildUrls(FacetInterface $facet, array $results) { // No results are found for this facet, so don't try to create urls. if (empty($results)) { return []; } // First get the current list of get parameters. $get_params = $this->request->query; // When adding/removing a filter the number of pages may have changed, // possibly resulting in an invalid page parameter. /* // コメントアウト if ($get_params->has('page')) { $current_page = $get_params->get('page'); $get_params->remove('page'); } */ // 追加 if ($get_params->has(OffsetPage::KEY_NAME)) { $page_params = $get_params->all(OffsetPage::KEY_NAME); unset($page_params[OffsetPage::OFFSET_KEY]); $get_params->set(OffsetPage::KEY_NAME, $page_params); } 上記の修正は、以下のファイルを参考に、Drupal\jsonapi\Query\OffsetPageを追加して、pageの処理を修正しました。 ...

AWS CLIを使用したS3バケットの一括削除

AWS CLIを使用してS3バケットの一覧を取得し、特定のパターンに基づいてバケットを削除するには、以下の手順を実行できます。ここでは、wbyという文字列で始まるバケットを削除する方法について説明します。必要なもの AWS CLIがインストールされていること。適切なAWSの認証情報とアクセス権限が設定されていること。ステップ 1: バケットの一覧を取得まず、インストールされているAWS CLIを使用して、すべてのS3バケットの一覧を取得します。 aws s3 ls ステップ 2: 条件に一致するバケットの削除 wbyで始まるバケットを削除するには、シェルスクリプトを利用して条件に一致するバケットをフィルタリングし、それらを削除します。以下のスクリプトは、wbyで始まるバケット名を検索し、各バケットを削除します。注意：このスクリプトはバケットとその中のすべてのオブジェクトを削除します。実行前にデータのバックアップを確認してください。 aws s3 ls | awk '{print $3}' | grep '^wby' | while read bucket do echo "Deleting bucket $bucket..." aws s3 rb s3://$bucket --force done このスクリプトは次のことを行います： aws s3 lsでバケット一覧を取得。 awk '{print $3}'でバケット名のみを抽出。 grep '^wby'でwbyで始まるバケット名をフィルタリング。 while read bucketループで各バケットを削除。注意バケットを削除する前に、必要なデータがバックアップされていることを確認してください。バケットが空でない場合、aws s3 rb --forceオプションを使用してバケットとその中のすべてのオブジェクトを削除します。実行する前に、削除されるバケット名を確認するために、実際に削除するコマンドを実行する前にechoステートメントを挟むことをお勧めします。

「SAT大蔵経DB 2018」で公開されているテキストの分析例

概要「SAT大蔵経DB 2018」は以下のように説明されています。 https://21dzk.l.u-tokyo.ac.jp/SAT2018/master30.php このサイトは、SAT大蔵経テキストデータベース研究会が提供するデジタル研究環境の2018年版です。 SAT大蔵経テキストデータベース研究会は、2008年4月より、大正新脩大藏経テキスト部分85巻の全文検索サービスを提供するとともに、各地のWebサービスとの連携機能を提供することにより、利便性を高めるとともに、Webにおける人文学研究環境の可能性を追求してきました。 2018年版となるSAT2018では、近年広まりつつある機械学習の技術と、IIIFによる高精細画像との連携、高校生でもわかる現代日本語訳の公開及び本文との連携、といった新たなサービスに取り組んでみました。また、本文の漢字をUnicode10.0に対応させるとともに、すでに公開していたSAT大正蔵図像DBの機能の大部分も統合いたしました。ただし、今回は、コラボレーションを含む仕組みの提供という側面もあり、今後は、この輪組に沿ってデータを増やし、より利便性を高めていくことになります。当研究会が提供するWebサービスは、さまざまな関係者が提供するサービスや支援に依拠しています。SAT2018で新たに盛り込んだサービスでは、機械学習及びIIIF対応に関しては一般財団法人人文情報学研究所、現代日本語訳の作成に関しては公益財団法人全日本仏教会の支援と全国の仏教研究者の方々のご協力をいただいております。 SAT2018が、仏教研究者のみなさまだけでなく、仏典に関心を持つ様々な方々のお役に立つことを願っております。さらに、ここで提示されている文化資料への技術の適用の仕方が、人文学研究における一つのモデルになることがあれば、なお幸いです。今回は、上記のDBが公開するテキストデータを対象として、簡単な分析を試みます。説明以下の「T0220 大般若波羅蜜多經」のテキストを対象にします。方法テキストデータの取得ネットワークを確認したところ、以下のようなURLでテキストデータを取得することができました。 https://21dzk.l.u-tokyo.ac.jp/SAT2018/satdb2018pre.php?mode=detail&ob=1&mode2=2&useid=0220_,05,0001 0220_,05,0001の部分について、05を06に変えると6巻のデータが取得できました。また、末尾の0001を0011に変更すると、0011の前後を含むテキストが取得できました。この傾向を踏まえて、以下のようなプログラムを実行しました。 import os import requests import time from bs4 import BeautifulSoup def fetch_soup(url): """Fetches and parses HTML content from the given URL.""" time.sleep(1) # Sleep for 1 second before making a request response = requests.get(url) return BeautifulSoup(response.text, "html.parser") def write_html(soup, filepath): """Writes the prettified HTML content to a file.""" with open(filepath, "w") as file: file.write(soup.prettify()) def read_html(filepath): """Reads HTML content from a file and returns its parsed content.""" with open(filepath, "r") as file: return BeautifulSoup(file.read(), "html.parser") def process_volume(vol): """Processes each volume by iterating over pages until no new page is found.""" page_str = "0001" while True: url = f"https://21dzk.l.u-tokyo.ac.jp/SAT2018/satdb2018pre.php?mode=detail&ob=1&mode2=2&useid=0220_{vol}_{page_str}" id = url.split("useid=")[1] opath = f"html/{id}.html" if os.path.exists(opath): soup = read_html(opath) else: soup = fetch_soup(url) write_html(soup, opath) new_page_str = get_last_page_id(soup) if new_page_str == page_str: break page_str = new_page_str def get_last_page_id(soup): """Extracts the last page ID from the soup object.""" spans = soup.find_all("span", class_="ln") if spans: last_id = spans[-1].text return last_id.split(".")[-1][0:4] return None def main(): vols = ["05", "06", "07"] for vol in vols: process_volume(vol) if __name__ == "__main__": main() 上記の処理により、HTMLファイルをダウンロードすることができます。 ...

Node.jsでXML文字列をパースする

概要 Node.jsでXML文字列を解析し、その中から情報を抽出するための関数を完成させるには、xmldom ライブラリの使用をお勧めします。これにより、ブラウザでDOMを操作するような方法でXMLを扱うことができます。以下に、xmldom を使ってXMLを解析し、“PAGE” タグに焦点を当てて要素を抽出するための関数の設定方法を示します。 xmldom ライブラリをインストールする: まず、XML文字列を解析するために必要な xmldom をインストールしてください。 npm install xmldom xmldom を使用してXMLを解析し、必要な要素を抽出します。 const { DOMParser } = require('xmldom'); const xmlString = "..."; // DOMParserを使用してXML文字列を解析 const parser = new DOMParser(); const xmlDoc = parser.parseFromString(xmlString, 'text/xml'); // 全てのPAGE要素を取得 const pages = xmlDoc.getElementsByTagName('PAGE'); // 発見されたPAGE要素の数をログに記録（例） console.log('PAGE要素の数:', pages.length); この例では、XML文字列をログに記録し、文書に解析し、各 “PAGE” 要素を繰り返し処理して属性や内容をログに記録する基本的な関数を設定します。ループ内の処理は、各ページから特定の詳細を抽出するなど、具体的な要件に基づいてカスタマイズできます。

LlamaIndex+GPT4+gradio

概要 LlamaIndexとGPT4、gradioを組み合わせて使う機会がありましたので、備忘録です。使用したテキストのサイズが小さいので、結果もそれなりですが、渋沢栄一のチャットボットを試作しました。背景以下の記事を参考にしました。 https://qiita.com/DeepTama/items/1a44ddf6325c2b2cd030 上記をもとに、2024年4月20日時点のライブラリで動作するように修正しています。ノートブックを以下で公開しています。 https://github.com/nakamura196/000_tools/blob/main/LlamaIndex%2BGPT4%2Bgradio.ipynb 以下のデータを使用しています。 TEIを用いた『渋沢栄一伝記資料』テキストデータの再構築と活用まとめ参考になりましたら幸いです。

Editor.jsでインラインのマーカーツールで作成する

概要 Editor.jsでインラインのマーカーツールを作成する方法の備忘録です。参考以下のページが参考になりました。 https://editorjs.io/creating-an-inline-tool/ https://note.com/eveningmoon_lab/n/n638b9541c47c TypeScriptでの記述にあたっては、以下が参考になりました。 https://github.com/codex-team/editor.js/issues/900 実装 Nuxtで実装します。以下のmarker.tsを作成します。 import type { API } from "@editorjs/editorjs"; class MarkerTool { button: null | HTMLButtonElement; state: boolean; api: API; tag: string; class: string; // 静的メソッドで許可されるHTMLタグと属性を指定 static get sanitize() { return { mark: { class: "cdx-marker", }, }; } // インラインツールとしての振る舞いを定義 static get isInline() { return true; } constructor({ api }: { api: API }) { this.api = api; this.button = null; this.state = false; this.tag = "MARK"; this.class = "cdx-marker"; } // ボタン要素を作成し、SVGアイコンを設定 render() { this.button = document.createElement("button"); this.button.type = "button"; this.button.innerHTML = '<svg width="20" height="18"><path d="M10.458 12.04l2.919 1.686-.781 1.417-.984-.03-.974 1.687H8.674l1.49-2.583-.508-.775.802-1.401zm.546-.952l3.624-6.327a1.597 1.597 0 0 1 2.182-.59 1.632 1.632 0 0 1 .615 2.201l-3.519 6.391-2.902-1.675zm-7.73 3.467h3.465a1.123 1.123 0 1 1 0 2.247H3.273a1.123 1.123 0 1 1 0-2.247z"/></svg>'; this.button.classList.add(this.api.styles.inlineToolButton); return this.button; } // 選択されたテキストを <mark> タグで囲む surround(range: Range) { if (this.state) { this.unwrap(range); return; } this.wrap(range); } // テキストを <mark> タグでラップ wrap(range: Range) { const selectedText = range.extractContents(); const mark = document.createElement(this.tag); mark.className = this.class; // class 属性の追加 mark.appendChild(selectedText); range.insertNode(mark); this.api.selection.expandToTag(mark); } // <mark> タグを解除 unwrap(range: Range) { const mark = this.api.selection.findParentTag(this.tag); const text = range.extractContents(); mark?.remove(); range.insertNode(text); } // ツールの状態をチェック checkState() { const mark = this.api.selection.findParentTag(this.tag, this.class); this.state = !!mark; if (this.state) { this.button?.classList.add("cdx-marker--active"); } else { this.button?.classList.remove("cdx-marker--active"); } } } export default MarkerTool; 上記を以下のように呼び出します。 ...

Editor.jsのmax-widthを変更する

概要 Editor.jsを使用する際、デフォルトでは左右に大きなマージンができます。これを解決する方法を紹介します。方法以下が参考になりました。 https://github.com/codex-team/editor.js/issues/1328 具体的には、以下を追加します。 .ce-block__content, .ce-toolbar__content { max-width: calc(100% - 80px) !important; } .cdx-block { max-width: 100% !important; } ソースコード全体は以下です。 <script setup lang="ts"> import EditorJS from "@editorjs/editorjs"; import type { OutputData } from "@editorjs/editorjs"; const blocks = ref<OutputData>({ time: new Date().getTime(), blocks: [ { type: "paragraph", data: { text: "大明副使蒋承奉すらく、欽差督察総制提督浙江等処軍務各衙門、近年以来、日本各島小民、仮るに買売を以て名と為し、しばしば中国辺境を犯し、居民を刼掠するを因となし、旨を奉じて、浙江等処承宣布政使司に議行し、本職に転行して、親しく貴国に詣り面議せしめん等の因あり。", }, }, ], }); const editor = () => { new EditorJS({ holder: "editorjs", data: blocks.value, onChange: async (api) => { blocks.value = await api.saver.save(); }, }); }; editor(); </script> <template> <div style="background-color: aliceblue"> <div id="editorjs"></div> <hljson :content="blocks" /> </div> </template> <style> .ce-block__content, .ce-toolbar__content { max-width: calc(100% - 80px) !important; } .cdx-block { max-width: 100% !important; } pre { background-color: #f4f4f4; border: 1px solid #ccc; padding: 10px; } </style> 結果、以下のように、左右のマージンが小さくなりました。 ...