Overview

JPCOAR Schema publishes XML Schema Definitions in the following repository. Thank you for creating the schema and making the data available.

https://github.com/JPCOAR/schema

This article is a memo of trying XML file validation using the above schema. (Since this is my first time doing this kind of validation, it may contain inaccurate terminology or information. I apologize.)

A Google Colab notebook is also prepared.

https://colab.research.google.com/github/nakamura196/ndl_ocr/blob/main/JPCOARスキーマを用いたxmlファイルのバリデーション.ipynb

Preparation

Clone the repository

cd /content/
git clone https://github.com/JPCOAR/schema.git

Install the library

pip install xsd-validator

Load the XSD file (v1)

from xsd_validator import XsdValidator
validator = XsdValidator('/content/schema/1.0/jpcoar_scm.xsd')

Trying v1

OK Example

<?xml version="1.0" ?>
<jpcoar:jpcoar
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:jpcoar="https://github.com/JPCOAR/schema/blob/master/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://github.com/JPCOAR/schema/blob/master/1.0/jpcoar_scm.xsd">
	<dc:title>JPCOARスキーマを用いたxmlファイルのバリデーション</dc:title>
	<dc:type rdf:resource="http://purl.org/coar/resource_type/c_6501">article</dc:type>
</jpcoar:jpcoar>
validator.assert_valid("/content/ok.xml")

# No errors

NG Example

Error from placing jpcoar:subject after dc:type?

<?xml version="1.0" ?>
<jpcoar:jpcoar
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:jpcoar="https://github.com/JPCOAR/schema/blob/master/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://github.com/JPCOAR/schema/blob/master/1.0/jpcoar_scm.xsd">
	<dc:title>JPCOARスキーマを用いたxmlファイルのバリデーション</dc:title>
	<dc:type rdf:resource="http://purl.org/coar/resource_type/c_6501">article</dc:type>
        <jpcoar:subject subjectScheme="Other">テスト</jpcoar:subject>
</jpcoar:jpcoar>
validator.assert_valid("/content/ng.xml")

XsdValidationErrorWithInfo: /content/ng.xml: line 9 column 41: cvc-complex-type.2.4.a: Invalid content was found starting with element ‘{"https://github.com/JPCOAR/schema/blob/master/1.0/":subject}’. One of ‘{"https://schema.datacite.org/meta/kernel-4/":version, “http://namespace.openaire.eu/schema/oaire/":version, “https://github.com/JPCOAR/schema/blob/master/1.0/":identifier, “https://github.com/JPCOAR/schema/blob/master/1.0/":identifierRegistration, “https://github.com/JPCOAR/schema/blob/master/1.0/":relation, “http://purl.org/dc/terms/":temporal, “https://schema.datacite.org/meta/kernel-4/":geoLocation, “https://github.com/JPCOAR/schema/blob/master/1.0/":fundingReference, “https://github.com/JPCOAR/schema/blob/master/1.0/":sourceIdentifier, “https://github.com/JPCOAR/schema/blob/master/1.0/":sourceTitle, “https://github.com/JPCOAR/schema/blob/master/1.0/":volume, “https://github.com/JPCOAR/schema/blob/master/1.0/":issue, “https://github.com/JPCOAR/schema/blob/master/1.0/":numPages, “https://github.com/JPCOAR/schema/blob/master/1.0/":pageStart, “https://github.com/JPCOAR/schema/blob/master/1.0/":pageEnd, “http://ndl.go.jp/dcndl/terms/":dissertationNumber, “http://ndl.go.jp/dcndl/terms/":degreeName, “http://ndl.go.jp/dcndl/terms/":dateGranted, “https://github.com/JPCOAR/schema/blob/master/1.0/":degreeGrantor, “https://github.com/JPCOAR/schema/blob/master/1.0/":conference, “https://github.com/JPCOAR/schema/blob/master/1.0/":file}’ is expected.

Fix

Try placing dc:type after jpcoar:subject

<?xml version="1.0" ?>
<jpcoar:jpcoar
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:jpcoar="https://github.com/JPCOAR/schema/blob/master/1.0/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://github.com/JPCOAR/schema/blob/master/1.0/jpcoar_scm.xsd">
	<dc:title>JPCOARスキーマを用いたxmlファイルのバリデーション</dc:title>
	<jpcoar:subject subjectScheme="Other">テスト</jpcoar:subject>
	<dc:type rdf:resource="http://purl.org/coar/resource_type/c_6501">article</dc:type>
</jpcoar:jpcoar>
validator.assert_valid("/content/fix.xml")

# No errors

Summary

Based on the error messages, we were able to fix the XML file.

The Google Colab notebook also includes validation examples targeting JPCOAR Schema Version 2.0.

There may be some inaccurate content, but I hope this serves as a helpful reference for XML file validation.