Tech | デジタルアーカイブシステムの技術ブログ

NextAuth.jsを使って、ORCID・The Open Science Framework・ GakuNin RDMの認証を行う

概要 NextAuth.jsを使って、ORCID・OSF（The Open Science Framework）・ GRDM（GakuNin RDM）の認証を行う方法です。デモアプリ ORCID https://orcid-app.vercel.app/ OSF https://osf-app.vercel.app/ GRDM https://rdm-app.vercel.app/ リポジトリ ORCID https://github.com/nakamura196/orcid_app 以下がオプションの記述例です。 https://github.com/nakamura196/orcid_app/blob/main/src/app/api/auth/[…nextauth]/authOptions.js export const authOptions = { providers: [ { id: "orcid", name: "ORCID", type: "oauth", clientId: process.env.ORCID_CLIENT_ID, clientSecret: process.env.ORCID_CLIENT_SECRET, authorization: { url: "https://orcid.org/oauth/authorize", params: { scope: "/authenticate", response_type: "code", redirect_uri: process.env.NEXTAUTH_URL + "/api/auth/callback/orcid", }, }, token: "https://orcid.org/oauth/token", userinfo: { url: "https://pub.orcid.org/v3.0/[ORCID]", async request({ tokens }) { const res = await fetch(`https://pub.orcid.org/v3.0/${tokens.orcid}`, { headers: { Authorization: `Bearer ${tokens.access_token}`, Accept: "application/json", }, }); return await res.json(); }, }, profile(profile) { return { id: profile["orcid-identifier"].path, // ORCID の ID を取得 name: profile.person?.name?.["given-names"]?.value + " " + profile.person?.name?.["family-name"]?.value, email: profile.person?.emails?.email?.[0]?.email, }; }, }, ], callbacks: { async session({ session, token }) { session.accessToken = token.accessToken; session.user.id = token.orcid; // ORCID ID をセッションに追加 return session; }, async jwt({ token, account }) { if (account) { token.accessToken = account.access_token; token.orcid = account.orcid; } return token; }, }, }; OSF https://github.com/nakamura196/osf-app ...

2024年11月15日 · 更新: 2024年11月15日 · 3 分 · Nakamura

YOLOv11xと日本古典籍くずし字データセットを用いた文字の検出モデルの構築

概要 YOLOv11xと日本古典籍くずし字データセットを用いた文字の検出モデルの構築を行う機会がありましたので、備忘録です。 http://codh.rois.ac.jp/char-shape/ 参考過去に、YOLOv5を用いて同様のことを行いました。以下のspacesで動作デモや学習済みモデルをご確認いただけます。 https://huggingface.co/spaces/nakamura196/yolov5-char 以下は、「国宝金沢文庫文書データベース」の公開画像に対する適用例です。 YOLOv11を用いることで、文字検出の精度向上を狙うことが目的です。データセットの作成「日本古典籍くずし字データセット」をダウンロードし、yoloで求められる形式に整形します。形式は以下などで確認することができます。 https://github.com/ultralytics/hub/tree/main/example_datasets/coco8 画像のサイズを1280x1280に設定以下のUltralytics HUBを使用しました。 https://hub.ultralytics.com/ 以下が学習結果です。テストデータに対して適用したところ、良い精度がでる画像データ（例：「『源氏物語』(東京大学総合図書館所蔵)」）もあれば、あまり良い精度がでない画像データ（例：「国宝金沢文庫文書データベース」）もありました。画像のサイズを640x640に設定エポック数が10の場合エポック数が10の場合は、エポック数が10の場合、学習が完全に収束していない可能性がありました。一方、エポック数が少ないにも関わらず、テストデータに対しては、1280x1280のものより良い結果を示すようでした。エポック数が100の場合 from ultralytics import YOLO # YOLOv8の分類モデルをロード model = YOLO('yolo11x.pt') # Nanoサイズの分類モデル # データセットとトレーニング設定 model.train( data='/home/mdxuser/yolo/chars_640_split/data.yaml', # データセットのパス epochs=100, # エポック数 # imgsz=224, # 入力画像サイズ batch=24 # バッチサイズ（オプション） ) バッチサイズが16（デフォルト）では、GPUメモリの使用率が低く、32に設定すると、OutOfMemoryErrorになってしまいました。 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.49 GiB. GPU 0 has a total capacity of 39.39 GiB of which 4.46 GiB is free. Including non-PyTorch memory, this process has 34.92 GiB memory in use. Of the allocated memory 31.86 GiB is allocated by PyTorch, and 2.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) バッチサイズが24 ...

2024年11月6日 · 更新: 2024年11月6日 · 1 分 · Nakamura

mdx.jpを用いてYOLOv11のクラス分類（くずし字認識）の学習を試す

概要 mdx.jpを用いてYOLOv11のクラス分類（くずし字認識）の学習を行う機会がありましたので、備忘録です。データセット以下の「くずし字データセット」を対象にします。 http://codh.rois.ac.jp/char-shape/book/ データセットの作成 yoloの形式に合致するようにデータセットを整形します。まず、書名ごとに分かれているデータをフラットにマージします。 #| export class Classification: def create_dataset(self, input_file_path, output_dir): # "../data/*/characters/*/*.jpg" files = glob(input_file_path) # output_dir = "../data/dataset" for file in tqdm(files): cls = file.split("/")[-2] output_file = f"{output_dir}/{cls}/{file.split('/')[-1]}" if os.path.exists(output_file): continue # print(f"Copying {file} to {output_file}") os.makedirs(f"{output_dir}/{cls}", exist_ok=True) shutil.copy(file, output_file) 次に、以下のようなスクリプトにより、データセットを分割します。 def split(self, input_dir, output_dir, train_ratio = 0.7, val_ratio = 0.15): if os.path.exists(output_dir): shutil.rmtree(output_dir) # クラスディレクトリの取得 classes = [d for d in os.listdir(input_dir) if os.path.isdir(os.path.join(input_dir, d))] # データを分割して保存 for cls in tqdm(classes): class_dir = os.path.join(input_dir, cls) files = [os.path.join(class_dir, f) for f in os.listdir(class_dir) if os.path.isfile(os.path.join(class_dir, f))] # シャッフルして分割 random.shuffle(files) train_end = int(len(files) * train_ratio) val_end = int(len(files) * (train_ratio + val_ratio)) train_files = files[:train_end] val_files = files[train_end:val_end] test_files = files[val_end:] # 保存ディレクトリを作成 for split, split_files in zip(["train", "val", "test"], [train_files, val_files, test_files]): split_dir = os.path.join(output_dir, split, cls) os.makedirs(split_dir, exist_ok=True) # ファイルをコピー for file in split_files: shutil.copy(file, os.path.join(split_dir, os.path.basename(file))) print("データの分割が完了しました。") 結果、1,086,326画像のデータセットが作成されました。 ...

2024年11月6日 · 更新: 2024年11月6日 · 4 分 · Nakamura

mdx.jpの1GPUパックとOllamaを使ってローカルLLMを実行する

概要 mdx.jpの1GPUパックとOllamaを使ってローカルLLMを実行する機会がありましたので、備忘録です。 https://mdx.jp/mdx1/p/guide/charge 参考以下の記事を参考にしました。 https://highreso.jp/edgehub/machinelearning/ollamainference.html モデルのダウンロードここでは、llama3.1:70bを対象にします。ダウンロード完了後、以下のように選択可能となります。使用例以下の『渋沢栄一伝記資料』を使用します。 https://github.com/shibusawa-dlab/lab1 APIの利用以下に記載がありました。 https://docs.openwebui.com/api/ 以下でJWTトークンとは別に、APIキーを発行します。以下が実行例です。 import requests import json text = '''六月十四日日　晴　風ナクシテ暑気昨日ニ比シテ少ク加フルヲ覚フ朝来少シク風邪気ナルニヨリ晏起、八時洗面ヲナス、後、六孫王宮ノ神官又ハ同志社員安藤氏等来訪ス、十時大阪支店長野口、神戸杉田、名古屋清水及西京支店長中川、其他小林、片野、前原等ノ諸氏ヲ伴ヒ嵐山ニ抵リ、三軒屋ニテ午飧シ、船ヲ浮ヘテ大江川《（堰）》ヲ遡ル、船中囲碁ノ興アリ、嵐山ノ緑葉少シク繁茂ニ過ルモ両岸ニハ山花咲乱レテ頗ル風致アリ、午後四時過玉川楼ニ帰宿ス、今朝、尾崎司法大臣ノ秘書黒田氏来ル、又、林和太郎氏（桂氏ノ父）来話ス、午前十時ヨリ各支店主任ト共ニ嵐山ニ抵リ三軒屋ニテ午飧ス、後、大江川《（大堰川）》ニ船ヲ浮ヘ、午後四時帰宿ス、後、玉川楼ニ於テ晩飧会ヲ開ク、種々ノ余興アリ、夜十時散会ス、中井三郎兵衛氏モ来会ス''' APIKEY = "sk-xxxx" url = "http://localhost:8080/api/chat/completions" headers = { "Authorization": f"Bearer {APIKEY}", "Content-Type": "application/json" } thres = 300 data = { "model": "llama3.1:70b", "messages": [ { "role": "user", "content": f"次のテキストは渋沢栄一の日記の一部です。テキストを{thres}字程度に要約してください。要約文のみを改行せずに返してください。句読点を適宜使用してください。\n\n{text}" } ] } response = requests.post(url, headers=headers, data=json.dumps(data)) # レスポンスの表示 print(response.status_code) print(response.json()) 結果、以下が得られました。ただし、結果が返却されるまでに60s弱かかってしまいました。 ...

2024年11月4日 · 更新: 2024年11月4日 · 1 分 · Nakamura

Google Cloud Vision APIを用いて、単一ページから構成される透明テキスト付きPDFを作成する

概要 PDFを対象に、Google Cloud Vision APIを使って、透明テキスト付きPDFを作成する機会がありましたので、備忘録です。以下、simpleで検索した例です。背景今回は単一ページから構成されるPDFを対象とします。手順画像の作成 OCRの対象とする画像を作成します。デフォルトの設定だとボヤけた画像ができてしまったので、解像度を2倍に設定し、また後述するプロセスで、解像度を考慮した位置合わせを実施しています。以下をインストールします。 PyMuPDF Pillow import fitz # PyMuPDF from PIL import Image import json from tqdm import tqdm import io # 入力PDFファイルと出力PDFファイル input_pdf_path = "./input.pdf" # 単一ページのPDFファイル output_pdf_path = "./output.pdf" # 入力PDFファイルを開き、単一ページを読み込み pdf_document = fitz.open(input_pdf_path) page = pdf_document[0] # 最初のページを選択 # ページを画像としてレンダリングし、OCRでテキストを抽出 # pix = page.get_pixmap() # 解像度300 DPIでレンダリング zoom = 2.0 # 解像度を上げるためにズーム設定 mat = fitz.Matrix(zoom, zoom) pix = page.get_pixmap(matrix=mat) img = Image.open(io.BytesIO(pix.tobytes("png"))) img.save("./image.png") Google Cloud Vision API 出力された画像を対象に、Google Cloud Vision APIを適用します。 { "textAnnotations": [ { "boundingPoly": { "vertices": [ { "x": 141, "y": 152 }, { "x": 1082, "y": 152 }, { "x": 1082, "y": 1410 }, { "x": 141, "y": 1410 } ] }, "description": "Sample PDF...", "locale": "la" }, { "boundingPoly": { "vertices": [ { "x": 141, "y": 159 }, { "x": 363, "y": 156 }, { "x": 364, "y": 216 }, { "x": 142, "y": 219 } ] }, "description": "Sample" }, { "boundingPoly": { "vertices": [ { "x": 382, "y": 156 }, { "x": 506, "y": 154 }, { "x": 507, "y": 213 }, { "x": 383, "y": 215 } ] }, "description": "PDF" }, ... 出力結果として得られるJSONファイルを./google_ocr.jsonといった名前で保存します。 ...

2024年11月2日 · 更新: 2024年11月2日 · 2 分 · Nakamura

ZoteroのAPIをNext.jsから使う

概要 ZoteroのAPIをNext.jsから使う方法を調べましたので、備忘録です。結果、以下のアプリケーションを作成しました。 https://zotero-rouge.vercel.app/ ライブラリ以下のライブラリを使用しました。 https://github.com/tnajdek/zotero-api-client API Keyなどの取得以下の記事を参考にしてください。使い方コレクション一覧 // app/api/zotero/collections/route.js import { NextResponse } from "next/server"; import api from "zotero-api-client"; import { prisma } from "@/lib/prisma"; import { decrypt } from "../../posts/encryption"; import { getSession } from "@auth0/nextjs-auth0"; async function fetchZoteroCollections( zoteroApiKey: string, zoteroUserId: string ) { const myapi = api(zoteroApiKey).library("user", zoteroUserId); const collectionsResponse = await myapi.collections().get(); return collectionsResponse.raw; } 特定のコレクション // app/api/zotero/collection/[id]/route.ts import { NextResponse } from "next/server"; import api from "zotero-api-client"; import { prisma } from "@/lib/prisma"; import { decrypt } from "@/app/api/posts/encryption"; import { getSession } from "@auth0/nextjs-auth0"; async function fetchZoteroCollection( zoteroApiKey: string, zoteroUserId: string, collectionId: string ) { const myapi = api(zoteroApiKey).library("user", zoteroUserId); const collectionResponse = await myapi.collections(collectionId).get(); return collectionResponse.raw; } 特定のコレクション内のアイテム一覧 // app/api/zotero/collection/[id]/items/route.ts import { NextResponse, NextRequest } from "next/server"; import api from "zotero-api-client"; import { prisma } from "@/lib/prisma"; import { decrypt } from "@/app/api/posts/encryption"; import { getSession } from "@auth0/nextjs-auth0"; async function fetchZoteroCollection( zoteroApiKey: string, zoteroUserId: string, collectionId: string ) { const myapi = api(zoteroApiKey).library("user", zoteroUserId); const collectionResponse = await myapi .collections(collectionId) .items() .get(); return collectionResponse.raw; 参考アプリケーションはVercelにホスティングされており、データベースにはVercel Postgres、ORMにはPrismaを使用しました。UIはTailwind CSSで構築され、ChatGPTのデザイン提案を使用しました。また、認証にはAuth0を採用しています。まとめ ZoteroのAPI利用にあたり、参考になりましたら幸いです。 ...

2024年11月1日 · 更新: 2024年11月1日 · 1 分 · Nakamura

LEAF WriterのEditor Toolbarをカスタマイズする

概要 LEAF Writerでは、画面上部にタグの挿入をサポートするボタンが提供されています。本記事では、その編集方法について紹介します。結果、以下のように、<app><lem>あああ</lem><rdg>いいい</rdg></app>を挿入する機能を追加しました。 https://youtu.be/XMnRP7s2atw 編集以下のファイルを編集します。 packages/cwrc-leafwriter/src/components/editorToolbar/index.tsx 以下のように、人名や地名のタグをサポートする機能が設定されています。例えば、以下では、organizationに関する記述をコメントアウトしています。 ... const items: (MenuItem | Item)[] = [ { group: 'action', hide: isReadonly, icon: 'insertTag', onClick: () => { if (!container.current) return; const rect = container.current.getBoundingClientRect(); const posX = rect.left; const posY = rect.top + 34; showContextMenu({ // anchorEl: container.current, eventSource: 'ribbon', position: { posX, posY }, useSelection: true, }); }, title: 'Tag', tooltip: 'Add Tag', type: 'button', }, { group: 'action', type: 'divider', hide: isReadonly }, { color: entity.person.color.main, group: 'action', disabled: !isSupported('person'), hide: isReadonly, icon: entity.person.icon, onClick: () => window.writer.tagger.addEntityDialog('person'), title: 'Tag Person', type: 'iconButton', }, { color: entity.place.color.main, group: 'action', disabled: !isSupported('place'), hide: isReadonly, icon: entity.place.icon, onClick: () => window.writer.tagger.addEntityDialog('place'), title: 'Tag Place', type: 'iconButton', }, /* { color: entity.organization.color.main, group: 'action', disabled: !isSupported('organization'), hide: isReadonly, icon: entity.organization.icon, onClick: () => window.writer.tagger.addEntityDialog('organization'), title: 'Tag Organization', type: 'iconButton', }, ... 結果、以下のように選択肢が限定されます。 ...

2024年10月31日 · 更新: 2024年10月31日 · 4 分 · Nakamura

GakuNin RDM APIをつかってみる

概要 GakuNin RDMでは、以下でAPIが公開されています。このAPIの使用例について備忘録です。 https://api.rdm.nii.ac.jp/v2/ 参考 GakuNin RDMはOSF（Open Science Framework）をベースに構築されており、APIに関するドキュメントは以下で確認することができます。OpenAPIに準拠しています。 https://developer.osf.io/ PATの取得 PAT（パーソナルアクセストークン）を取得します。ログイン後、以下のURLから作成することができます。 https://rdm.nii.ac.jp/settings/tokens/ 利用以下のようなスクリプトにより、プログラムからもアクセスすることができます。 access_token=xxx import requests import os from dotenv import load_dotenv load_dotenv(verbose=True) load_dotenv("./env") access_token = os.environ.get("access_token") # アップロードURLにクエリパラメータを追加 url = f'https://api.rdm.nii.ac.jp/v2/nodes/' # ファイルを開き、PUTリクエストでアップロード # with open(file_path, 'rb') as file: response = requests.get( url, headers={ 'Authorization': f'Bearer {access_token}' } ) response.json() 参考までに、Authorizationヘッダーが未指定の場合、以下の結果が返却されました。 {'data': [], 'links': {'first': None, 'last': None, 'prev': None, 'next': None, 'meta': {'total': 0, 'per_page': 10}}, 'meta': {'version': '2.0'}} ブラウザで確認 APIの出力結果はブラウザでも確認することができます。 https://api.rdm.nii.ac.jp/v2/ 以下のように、Django REST frameworkで作成されていることがわかります。また、おそらく「Django REST framework JSON:API」を使用し、JSON:APIが採用されていることがわかります。まとめ GakuNin RDMおよびOSF（Open Science Framework）のAPIについて、参考になりましたら幸いです。 ...

2024年10月26日 · 更新: 2024年10月26日 · 1 分 · Nakamura

ShExCとShExJの違い

概要 ShExC（ShEx Compact Syntax）とShExJ（ShEx JSON Syntax）の違いについて、ChatGPTによる回答です。誤っている点もあるかもしれませんが、参考になりましたら幸いです。回答 ShExC（ShEx Compact Syntax）とShExJ（ShEx JSON Syntax）は、どちらもShEx（Shape Expressions）スキーマの表現形式ですが、表記形式や用途が異なります。以下にその違いを説明します。 1. 表記形式 ShExC（ShEx Compact Syntax） : テキストベースのシンタックスです。人間が読み書きしやすいように設計されており、シンプルかつ簡潔にRDFデータの構造や制約を記述することができます。主に手動で記述したり、コード内に直接埋め込むことを想定して設計されています。例 : <PersonShape> { foaf:name xsd:string ; foaf:age xsd:integer ? } * これは「Person」という形状が、`foaf:name` という必須の文字列プロパティと、`foaf:age` というオプションの整数プロパティを持つことを示しています。 ShExJ（ShEx JSON Syntax） : JSONベースのシンタックスです。機械可読性を重視した形式で、ShExスキーマをプログラムやAPIで扱うために設計されています。ツールやライブラリで処理しやすいため、特にプログラム間でのデータ交換やAPIレスポンスで使用されることが多いです。例 : { "type": "Schema", "shapes": [ { "id": "PersonShape", "type": "Shape", "expression": { "type": "EachOf", "expressions": [ { "type": "TripleConstraint", "predicate": "foaf:name", "valueExpr": { "type": "NodeConstraint", "datatype": "xsd:string" } }, { "type": "TripleConstraint", "predicate": "foaf:age", "valueExpr": { "type": "NodeConstraint", "datatype": "xsd:integer" }, "min": 0, "max": 1 } ] } } ] } 2. 用途 ShExC : ...

2024年10月25日 · 更新: 2024年10月25日 · 1 分 · Nakamura

ShExとSHACLの違い

概要 ShEx（Shape Expressions）Schema と SHACL（Shapes Constraint Language）の違いに関するChatGPTによる回答です。間違っている点もあるかもしれませんが、参考になりましたら幸いです。回答 ShEx（Shape Expressions）Schema と SHACL（Shapes Constraint Language）は、どちらもRDFデータに対する検証や制約を定義するための言語です。両者は同じ目的を持ちながらも、構文やアプローチに違いがあります。以下にその違いを説明します。 1. 目的と用途 ShEx（Shape Expressions） : ShExは、RDFデータが特定の形状（Shape）に従っているかどうかを簡潔に定義し、検証するための言語です。主に、柔軟で簡単なスキーマ定義を目的として設計されており、特定の形状にデータが適合するかどうかの判定がしやすいです。スキーマ検証が主な用途で、特定のシェイプにデータが従っているかどうかをすばやく確認できます。 SHACL（Shapes Constraint Language） : SHACLは、RDFデータに対して複雑な制約やルールを定義するために設計されたW3C標準の言語です。より柔軟で強力な検証機能を持ち、スキーマ定義だけでなく、制約の論理的な定義や高次の推論を行うことができます。例えば、条件付き制約や、プロパティの型、データの依存関係など、より複雑なルールを記述できます。 2. 構文と表現力 ShEx : シンプルでコンパクトな構文を持ち、RDFデータの形状を定義する際に、簡単に読み書きできるよう設計されています。直感的で短い記述でシェイプを定義できるため、学習コストが低く、データ検証を行う際に適しています。例（ShExC形式）: <PersonShape> { foaf:name xsd:string ; foaf:age xsd:integer ? } * これは、「`PersonShape`は、`foaf:name`として文字列型、`foaf:age`として整数型のオプションプロパティを持つ」という意味です。 SHACL : SHACLはRDFグラフ上で定義され、検証ロジックをRDFのトリプル形式で記述します。 SPARQLクエリを使って検証を行うことができるため、より複雑なルールや推論が可能です。例（Turtle形式）: ex:PersonShape a sh:NodeShape ; sh:targetClass foaf:Person ; sh:property [ sh:path foaf:name ; sh:datatype xsd:string ; ] ; sh:property [ sh:path foaf:age ; sh:datatype xsd:integer ; sh:maxCount 1 ; ] . * これは「`PersonShape`は、`foaf:name`として文字列型、`foaf:age`として整数型で1つまでのプロパティを持つ」という意味です。 3. 検証方法 ShEx : ...

2024年10月25日 · 更新: 2024年10月25日 · 1 分 · Nakamura

@samvera/rampビューアのFiles/Markersタブの使用方法

概要 IIIF Audio/Visualに対応したビューアの一つである@samvera/rampビューアのFiles/Markersタブの使用方法について調べたので備忘録です。マニュアル Filesについては、以下に記載がありました。 https://samvera-labs.github.io/ramp/#supplementalfiles また、Markersについては、以下に記載があります。 https://samvera-labs.github.io/ramp/#markersdisplay 使用するデータ『県政ニュース第1巻』（県立長野図書館）を使用します。 https://www.ro-da.jp/shinshu-dcommons/library/02FT0102974177 Filesタブ renderingプロパティを読むと記載されています。renderingプロパティについては、以下のCookbookにも掲載されています。 https://iiif.io/api/cookbook/recipe/0046-rendering/ 以下のようなスクリプトにより、マニフェストファイルにrenderingプロパティを追加します。 def add_rendering(self, manifest_path): manifest = self.load_manifest(manifest_path) japan_search_id = manifest.homepage[1].id.split("/")[-1] japan_search_api_url = f"https://jpsearch.go.jp/api/item/{japan_search_id}" rendering = ResourceItem( label={ "ja": ["アイテム参照API"], }, id=japan_search_api_url, type="Dataset", format="application/json" ) manifest.rendering = rendering output_path = f"{self.input_dir}/manifest_rendering.json" with open(output_path, "w") as f: f.write(manifest.json(indent=2)) return output_path 以下のようなマニフェストファイルが作成されます。 { "@context": "http://iiif.io/api/presentation/3/context.json", "id": "https://d1u7hq8ziluwl9.cloudfront.net/sdcommons_npl-02FT0102974177/manifest.json", "type": "Manifest", "label": { "ja": [ "県政ニュース　第1巻" ] }, "requiredStatement": { "label": { "ja": [ "Attribution" ] }, "value": { "ja": [ "『県政ニュース　第1巻』（県立長野図書館）を改変" ] } }, "rendering": [ { "id": "https://jpsearch.go.jp/api/item/sdcommons_npl-02FT0102974177", "type": "Dataset", "label": { "ja": [ "アイテム参照API" ] }, "format": "application/json" } ], ... } ビューアでの表示例は以下です。 ...

2024年10月17日 · 更新: 2024年10月17日 · 3 分 · Nakamura

Omeka SのOAI-PMHリポジトリのresumptionTokenの不具合への対応

概要 Omeka SのOAI-PMHリポジトリのresumptionTokenにおいて、有効期限内にもかかわらず、以下のように、[badResumptionToken]が出力される事例に遭遇しました。この不具合への対処方法についてメモします。対応以下のファイルについて、$currentTimeと$expirationTimeを比較する処理を加えたところ、有効期限内のtokenが残るようになりました。 ... private function resumeListResponse($token): void { $api = $this->serviceLocator->get('ControllerPluginManager')->get('api'); $expiredTokens = $api->search('oaipmh_repository_tokens', [ 'expired' => true, ])->getContent(); foreach ($expiredTokens as $expiredToken) { $currentTime = new \DateTime(); // 追加 $expirationTime = $expiredToken->expiration();　// 追加 if (!$expiredToken || $currentTime > $expirationTime) {　// 追加 $api->delete('oaipmh_repository_tokens', $expiredToken->id()); }　// 追加 } 上記のような対応をしなくてもうまくいくケースがあったので、PHPのバージョン等による違いがあるのかもしれません。参考参考までに、OAI-PMHリポジトリモジュールによって作成されるテーブルの中身を確認しました。 ...

2024年10月10日 · 更新: 2024年10月10日 · 2 分 · Nakamura

（非標準）Omeka SのOAI-PMH RepositoryモジュールでDeleteレコードを出力してみる

概要 Omeka SのOAI-PMH RepositoryモジュールでDeleteレコードを出力してみましたので、備忘録です。背景以下のモジュールを使用することにより、OAI-PMHのリポジトリ機能を構築することができます。 https://omeka.org/s/modules/OaiPmhRepository/ ただ、確認した限り、Deleteレコードを出力する機能はないようでした。関連モジュール Omekaの標準機能では、削除されたリソースを保存する機能はないかと思います。一方、以下のモジュールは削除されたリソースを保持する機能を追加します。 https://github.com/biblibre/omeka-s-module-Necropolis 本モジュールを有効化することにより、以下のように、リソースがいつ誰によって削除されたかを記録できるようになりました。 OAI-PMH Repositoryモジュールへの応用上記のモジュールで作成される削除されたリソースの情報が格納されるテーブルを使って、Deleteレコードの出力を試みます。以下のファイルのlistResponse関数に追記します。 private function listResponse($verb, $metadataPrefix, $cursor, $set, $from, $until): void { /** * @var \Omeka\Api\Adapter\Manager $apiAdapterManager * @var \Doctrine\ORM\EntityManager $entityManager */ $apiAdapterManager = $this->serviceLocator->get('Omeka\ApiAdapterManager'); $entityManager = $this->serviceLocator->get('Omeka\EntityManager'); $itemRepository = $entityManager->getRepository(\Omeka\Entity\Item::class); $qb = $itemRepository->createQueryBuilder('omeka_root'); $qb->select('omeka_root'); $query = new ArrayObject; $expr = $qb->expr(); // 以下を追加 if ($set === 'o:deleted') { $settings = $this->serviceLocator->get('Omeka\Settings'); $namespaceId = $settings->get('oaipmhrepository_namespace_id', 'default_namespace'); // 削除済みレコードを necropolis_resource テーブルから取得する $deletedResourceRepository = $entityManager->getRepository(\Necropolis\Entity\NecropolisResource::class); // カスタムエンティティ // oaipmhrepository_expose_mediaに応じて、mediaとitemの取得を分ける $exposeMedia = $settings->get('oaipmhrepository_expose_media', false); // デフォルトはfalse（itemのみ） if ($exposeMedia) { $qb = $deletedResourceRepository->createQueryBuilder('necropolis_resource'); } else { // Itemのみを取得する $qb = $deletedResourceRepository->createQueryBuilder('necropolis_resource') ->andWhere('necropolis_resource.resourceType = :itemType') ->setParameter('itemType', 'Omeka\Entity\Item'); } // 日付フィルタリング if ($from) { $qb->andWhere($expr->gte('necropolis_resource.deleted', ':from')); $qb->setParameter('from', $from); } if ($until) { $qb->andWhere($expr->lte('necropolis_resource.deleted', ':until')); $qb->setParameter('until', $until); } // 結果の制限とオフセット $qb->setMaxResults($this->_listLimit); $qb->setFirstResult($cursor); $paginator = new Paginator($qb, false); $rows = count($paginator); if ($rows == 0) { $this->throwError(self::OAI_ERR_NO_RECORDS_MATCH, new Message('No records match the given criteria.')); // @translate } else { if ($verb == 'ListIdentifiers') { $method = 'appendHeader'; } elseif ($verb == 'ListRecords') { $method = 'appendRecord'; } $verbElement = $this->document->createElement($verb); $this->document->documentElement->appendChild($verbElement); foreach ($paginator as $deletedEntity) { // 削除されたリソースの情報をOAI-PMHレスポンスに追加 $header = $this->document->createElement('header'); $header->setAttribute('status', 'deleted'); // 削除済みレコードとして設定 $identifier = $this->document->createElement('identifier', 'oai:' . $namespaceId . ":" . $deletedEntity->getId()); $header->appendChild($identifier); $datestamp = $this->document->createElement('datestamp', $deletedEntity->getDeleted()->format('Y-m-d\TH:i:s\Z')); $header->appendChild($datestamp); $verbElement->appendChild($header); } // Resumption Token の処理 if ($rows > ($cursor + $this->_listLimit)) { $token = $this->createResumptionToken($verb, $metadataPrefix, $cursor + $this->_listLimit, $set, $from, $until); $tokenElement = $this->document->createElement('resumptionToken', (string) $token->id()); $tokenElement->setAttribute('expirationDate', $token->expiration()->format('Y-m-d\TH:i:s\Z')); $tokenElement->setAttribute('completeListSize', (string) $rows); $tokenElement->setAttribute('cursor', (string) $cursor); $verbElement->appendChild($tokenElement); } elseif ($cursor != 0) { $tokenElement = $this->document->createElement('resumptionToken'); $verbElement->appendChild($tokenElement); } } return; } ... OAI-PMH標準には合致していない実装方法ですが、setにo:deletedを指定すると、削除レコードを返却することができます。 ...

2024年10月10日 · 更新: 2024年10月10日 · 2 分 · Nakamura

iiif-prezi3を使って、動画に目次を付与する

概要 iiif-prezi3を使って、動画に目次を付与する方法に関する備忘録です。セグメントの検出 Amazon Rekognitionのビデオセグメントの検出を用います。 https://docs.aws.amazon.com/ja_jp/rekognition/latest/dg/segments.html 以下などでサンプルコードが公開されています。 https://docs.aws.amazon.com/ja_jp/rekognition/latest/dg/segment-example.html 使用するデータ『県政ニュース第1巻』（県立長野図書館）を使用します。 https://www.ro-da.jp/shinshu-dcommons/library/02FT0102974177 マニフェストファイルへの反映以下の記事などを参考に、マニフェストファイルが作成済みであるとします。以下のようなスクリプトにより、マニフェストファイルにvttファイルを追加します。 from iiif_prezi3 import Manifest, AnnotationPage, Annotation, ResourceItem, config, HomepageItem, KeyValueString #| export class IiifClient: def load_manifest(self, manifest_path): with open(manifest_path, "r") as f: manifest_json = json.load(f) manifest = Manifest(**manifest_json) return manifest def add_segment(self, manifest_path): manifest = self.load_manifest(manifest_path) path = f"{self.input_dir}/output_segment.json" with open(path, "r") as f: data = json.load(f) range_id = f"{self.prefix}/range" range_toc = manifest.make_range( id=f"{range_id}/r", label="Table of Contents" ) canvas_id = manifest.items[0].id for s in data["Segments"]: s_type = s["Type"] if s_type != "SHOT": continue index = s["Shot Index"] start = s["Start Timestamp (milliseconds)"] / 1000 end = s["End Timestamp (milliseconds)"] / 1000 range_seg = range_toc.make_range( id=f"{range_id}/r{index}", label=f"Segment {index}", ) range_seg.items.append({ "id": f"{canvas_id}#t={start},{end}", "type": "Canvas" }) output_path = f"{self.input_dir}/manifest_segment.json" with open(output_path, "w") as f: f.write(manifest.json(indent=2)) return output_path 以下のようなマニフェストファイルが作成されます。 ...

2024年10月9日 · 更新: 2024年10月9日 · 2 分 · Nakamura

iiif-prezi3を使って、動画に字幕を設定する

概要 iiif-prezi3を使って、動画に字幕を設定する方法に関する備忘録です。字幕の作成 OpenAIのAPIを使用して字幕ファイルを作成しました。動画のファイルを音声ファイルに変換しています。 from openai import OpenAI from pydub import AudioSegment from dotenv import load_dotenv class VideoClient: def __init__(self): load_dotenv(verbose=True) api_key = os.getenv("OPENAI_API_KEY") self.client = OpenAI(api_key=api_key) def get_transcriptions(self, input_movie_path): audio = AudioSegment.from_file(input_movie_path) # 一時ファイルにオーディオを書き込む with tempfile.NamedTemporaryFile(suffix=".mp3") as temp_audio_file: audio.export(temp_audio_file.name, format="mp3") # MP3形式でエクスポート temp_audio_file.seek(0) # ファイルポインタを先頭に戻す # Whisper APIでトランスクリプトを取得 with open(temp_audio_file.name, "rb") as audio_file: # Whisper APIでトランスクリプトを取得 transcript = self.client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="vtt" ) return transcript 使用するデータ『県政ニュース第1巻』（県立長野図書館）を使用します。 https://www.ro-da.jp/shinshu-dcommons/library/02FT0102974177 マニフェストファイルへの反映以下の記事などを参考に、マニフェストファイルが作成済みであるとします。以下のようなスクリプトにより、マニフェストファイルにvttファイルを追加します。 from iiif_prezi3 import Manifest, AnnotationPage, Annotation, ResourceItem, config, HomepageItem, KeyValueString #| export class IiifClient: def load_manifest(self, manifest_path): with open(manifest_path, "r") as f: manifest_json = json.load(f) manifest = Manifest(**manifest_json) return manifest def add_vtt(self, manifest_simple_path): manifest = self.load_manifest(manifest_simple_path) vtt_url = f"{self.prefix}/video.vtt" canvas = manifest.items[0] vtt_anno_page = AnnotationPage(id=f"{canvas.id}/page2") canvas.annotations = [ vtt_anno_page, ] vtt_body = ResourceItem(id=vtt_url, type="Text", format="text/vtt") vtt_anno = Annotation( id=f"{vtt_anno_page.id}/a1", motivation="supplementing", body=vtt_body, target=canvas.id, label = "WebVTT Transcript (machine-generated)" ) vtt_anno_page.add_item(vtt_anno) with open(f"{self.input_dir}/manifest_vtt.json", "w") as f: f.write(manifest.json(indent=2)) 以下のようなマニフェストファイルが作成されます。 ...

2024年10月9日 · 更新: 2024年10月9日 · 2 分 · Nakamura

iiif-prezi3を使って、動画にアノテーションを付与する

概要 iiif-prezi3を使って、動画にアノテーションを付与する方法に関する備忘録です。アノテーションの付与 Amazon Rekognitionのlabel detectionを用います。 https://docs.aws.amazon.com/rekognition/latest/dg/labels.html?pg=ln&sec=ft 以下などでサンプルコードが公開されています。 https://docs.aws.amazon.com/ja_jp/rekognition/latest/dg/labels-detecting-labels-video.html 特に、GetLabelDetectionにおける集計をSEGMENTSにすることで、StartTimestampMillisとEndTimestampMillisを得ることができます。ただし、以下の点に注意が必要です。 SEGMENTS による集計の場合、境界ボックス付きの検出されたインスタンスに関する情報は返されません。使用するデータ『県政ニュース第1巻』（県立長野図書館）を使用します。 https://www.ro-da.jp/shinshu-dcommons/library/02FT0102974177 マニフェストファイルへの反映以下の記事などを参考に、マニフェストファイルが作成済みであるとします。以下のようなスクリプトにより、マニフェストファイルにvttファイルを追加します。 from iiif_prezi3 import Manifest, AnnotationPage, Annotation, ResourceItem, config, HomepageItem, KeyValueString #| export class IiifClient: def load_manifest(self, manifest_path): with open(manifest_path, "r") as f: manifest_json = json.load(f) manifest = Manifest(**manifest_json) return manifest def add_label_segment(self, manifest_path): manifest = self.load_manifest(manifest_path) label_path = f"{self.input_dir}/output_label_seg.json" with open(label_path, "r") as f: label_seg = json.load(f) canvas = manifest.items[0] labels = label_seg["Labels"] anno_page_id = f"{canvas.id}/page1" anno_page = AnnotationPage(id=anno_page_id) canvas.annotations.append(anno_page) for i in range(len(labels)): label = labels[i] start = label["StartTimestamp"] / 1000 end = label["EndTimestamp"] / 1000 name = label["Label"]["Name"] anno_id = f"{anno_page_id}/a{i}" anno = Annotation( id=anno_id, motivation="tagging", target=canvas.id + "#t=" + str(start) + "," + str(end), body={ "type": "TextualBody", "value": name, "format": "text/plain", } ) anno_page.add_item( anno ) output_path = f"{self.input_dir}/manifest_label_seg.json" with open(output_path, "w") as f: f.write(manifest.json(indent=2)) return output_path 以下のようなマニフェストファイルが作成されます。 ...

2024年10月9日 · 更新: 2024年10月9日 · 2 分 · Nakamura

Next.jsでアンダースコアから始まるURLセグメントを利用する

概要 </api/_search> のようなAPIを作成するにあたり、アンダースコアから始まるURLセグメントを作成する方法を調べましたので、備忘録です。方法以下に記載がありました。 https://nextjs.org/docs/app/building-your-application/routing/colocation#:~:text=js file conventions.-,Good to know,-While not a 以下、日本語訳です。 URLセグメントでアンダースコアから始まる部分を作成するには、フォルダー名に %5F（アンダースコアのURLエンコード形式）を付けてください。例: %5FfolderName。 /api/%5Fsearch/route.ts のようにファイルを作成することで、解決することができました。まとめ参考になりましたら幸いです。

2024年10月8日 · 更新: 2024年10月8日 · 1 分 · Nakamura

@elastic/search-uiのsetFilterの不具合対応

概要 @elastic/search-uiのsetFilterについて、不具合が報告されています。 https://github.com/elastic/search-ui/issues/1057 この不具合について、すでに以下のコミットで修正されています。 https://github.com/elastic/search-ui/pull/1058 ただ、2024年10月7日時点において、上記の対応がなされた最新版がリリースされていません。そこで、独自にビルドしてリリースすることを試みましたので、その手順に関する備忘録です。修正まず、リポジトリをフェッチしました。 https://github.com/nakamura196/search-ui そして、以下の修正を加えました。 https://github.com/nakamura196/search-ui/commit/f7c7dc332086ca77a2c488f3de8780bbeb683324 具体的には、package.jsonと.npmrcに変更を加えました。パッケージの公開以下の手順により、パッケージを公開します。 1. GitHub パーソナルアクセストークンを生成 GitHub Package Registry にアクセスするためには、GitHub のパーソナルアクセストークンが必要です。以下の手順でトークンを作成します: GitHubのアカウントにログインします。右上のプロフィール画像をクリックし、「Settings」を選択。左側のメニューから「Developer settings」を選択。「Personal access tokens (classic)」を選択し、「Generate new token」をクリック。必要な権限（write:packages と read:packages など）を選択して、トークンを生成します。 2. npm にログイン次に、GitHub Package Registry にログインするために以下のコマンドを実行します： npm login --registry=https://npm.pkg.github.com 実行すると、以下の情報が要求されます： Username: GitHub のユーザー名 Password: 先ほど生成したパーソナルアクセストークン Email: GitHub に登録しているメールアドレス 3. パッケージを再度公開認証が完了したら、再度 npm publish を実行してパッケージを公開します。 npm publish --registry=https://npm.pkg.github.com 結果、以下のようなページが作成されました。利用利用するリポジトリにおいて、以下を作成します。vercelでのビルドを想定して、GITHUB_TOKENは環境変数から参照するようにしました。 @nakamura196:registry=https://npm.pkg.github.com //npm.pkg.github.com/:_authToken=${GITHUB_TOKEN} まとめ結局上記の方法では、もともとの課題であったsetFilterの修正はうまくいかなかったのですが、パッケージの公開方法に関して、参考になる部分がありましたら幸いです。 ...

2024年10月7日 · 更新: 2024年10月7日 · 1 分 · Nakamura

rico-converterを試す

概要 rico-converterを試す機会がありましたので、備忘録です。 https://github.com/ArchivesNationalesFR/rico-converter 以下のように説明されています。 A tool to convert EAC-CPF and EAD 2002 XML files to RDF datasets conforming to Records in Contexts Ontology (RiC-O) （機会翻訳）EAC-CPFやEAD 2002のXMLファイルを、Records in Contexts Ontology（RiC-O）に準拠したRDFデータセットに変換するためのツール変換する以下に説明があります。 https://archivesnationalesfr.github.io/rico-converter/en/GettingStarted.html まず、以下から最新のzipファイルをダウンロードして、展開します。 https://github.com/ArchivesNationalesFR/rico-converter/releases/latest サンプルデータとして、input-eacとinput-eadがあるので、これらをRDFに変換します。 input-eac 本フォルダ内にあるinput-eac/FRAN_NP_051151.xmlについて、ChatGPTによる解説は以下です。 ! このXMLファイルは、EAC-CPF（Encoded Archival Context for Corporate Bodies, Persons, and Families）という形式で記述されており、フランスの文化省（Ministère de la Culture）などの組織に関する情報を体系的に整理したものです。主な要素は次の通りです： • Controlセクション: レコードに関するメタデータを含んでいます。これには、レコードID、言語の宣言、更新履歴、使用した資料の出典などが含まれます。 • Identityセクション: 法人（ここではフランスの文化省）に関する基本的な情報が記載されています。文化省の名前が変遷してきたことを示す複数の名称や、各名称が使われていた期間も詳細に記述されています。 • Descriptionセクション: 文化省の存在期間、法的地位、主要な機能、任務、さらにその歴史的な変遷に関する詳細な説明が含まれています。たとえば、1959年に設立され、法令に基づいてどのように運営されてきたか、組織の構造がどのように変化したかが説明されています。 • Relationsセクション: 他の組織や人物との関係を示しています。ここには、フランス国内の関連機関や教育機関、さらには歴代の文化相との関係が詳細に記載されており、外部のリソース（アーカイブやオンラインの情報）へのリンクも含まれています。このXMLは、アーカイブや組織に関する情報を標準化された形式で記述し、他のデータベースやシステムと連携することを目的としています。特に、組織の歴史や関係性を追跡する際に有用です。以下を実行します。 sh ricoconverter.sh デフォルトのまま実行すると、output-eac-20241005といったフォルダが作成され、その中にRDFファイルが出力されます。 input-ead 本フォルダ内にあるinput-ead/FRAN_IR_003500.xmlについて、ChatGPTによる解説は以下です。 ...

2024年10月5日 · 更新: 2024年10月5日 · 3 分 · Nakamura

Hugging Face SpacesとYOLOv5モデル（顔コレデータセットで学習済み）を使った推論アプリの構築

概要 Hugging Face Spacesと、YOLOv5モデル（顔コレデータセットで学習済み）を使った推論アプリを作成しました。人文学オープンデータ共同利用センターが公開する顔コレデータセットは以下です。 Yingtao Tian, Chikahiko Suzuki, Tarin Clanuwat, Mikel Bober-Irizar, Alex Lamb, Asanobu Kitamoto, “KaoKore: A Pre-modern Japanese Art Facial Expression Dataset”, arXiv:2002.08595. http://codh.rois.ac.jp/face/dataset/ 推論アプリは、以下のURLからお試しいただけます。 https://huggingface.co/spaces/nakamura196/yolov5-face また以下のURLからソースコードや学習済みモデルをダウンロードすることができます。同様のアプリケーションを開発される際の参考になれば幸いです。 https://huggingface.co/spaces/nakamura196/yolov5-face/tree/main なお、アプリケーションの開発には、以下のSpaceを参考にしています。 https://huggingface.co/spaces/pytorch/YOLOv5 使い方画像をアップロードするか、Examplesから画像を選択してご利用いただけます。以下のように認識結果を確認することができます。『文正草子』（日本古典籍データセット（国文研所蔵）CODH配信）まとめ検出精度が十分でない点がありますが、参考になりましたら幸いです。現在はYOLOv5を用いたモデルですが、今後より新しいモデルも試してみたいと思います。

2024年10月5日 · 更新: 2024年10月5日 · 1 分 · Nakamura