222 lines
No EOL
7 KiB
Text
222 lines
No EOL
7 KiB
Text
# xbib MARC
|
|
|
|
## Bibliographic data processing library for Java
|
|
|
|
image::https://api.travis-ci.org/xbib/marc.svg[Build status]
|
|
image::https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
|
|
image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]
|
|
|
|
This is a Java library for processing bibliographic data in the following formats:
|
|
|
|
- ISO 2709/Z39.2
|
|
- MARC (USMARC, MARC 21, MARC XML)
|
|
- MarcXchange (ISO 25577:2013)
|
|
- UNIMARC
|
|
- MAB (MAB2, MAB XML)
|
|
- dialects of MARC (Aleph Sequential, Pica, SISIS format)
|
|
|
|
The motivation of this library is to transport bibliographic data into XML or JSON based formats,
|
|
with the focus on european/german application environment.
|
|
|
|
The most known and widespread bibliographic data format is MARC, which stands for "machine readable cataloging"
|
|
and was developed by the Library of Congress 1968. Inspired by the success of MARC, several other formats, mostly based
|
|
on MARC, were developed in the 1970s, some very similar, some with significant differences. Most notable
|
|
is the UNIMARC format, developed by IFLA.
|
|
|
|
MARC does not offer the features of XML or JSON, it is not a document format
|
|
or a format for the Web. MARC is stream-based "structured data", composed of fields in sequential order,
|
|
and was targeted to write records on magnetic tape.
|
|
Today, magnetic tape data distribution service is history. Also, file distribution via FTP, common in the 1990s,
|
|
does not fit well into a highly linked and sophisticated information infrastructure like the Semantic Web.
|
|
|
|
This library offers the first step in the complex procedure to move MARC data into computer applications of today,
|
|
by writing MARC fields to XML or JSON formats. More steps would include the generation of
|
|
graph structures (RDF triples) by processing MARC records in context, but that is not part of this package.
|
|
|
|
The library provides a fluent interface and a rich set of input streams, content handlers and listeners.
|
|
Provided are writers for XML, stylesheet transformations (MODS), and a JSON writer for
|
|
key/value-oriented JSON, suitable for indexing into Elasticsearch. Indexing into Elasticsearch is not
|
|
part of this package.
|
|
|
|
Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.
|
|
|
|
```
|
|
try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
|
|
Marc.builder()
|
|
.setInputStream(in)
|
|
.setCharset(Charset.forName("ANSEL"))
|
|
.setMarcListener(writer)
|
|
.build()
|
|
.writeCollection();
|
|
}
|
|
```
|
|
|
|
Here is an example to create MODS from an ISO 2709 stream
|
|
|
|
```
|
|
Marc marc = Marc.builder()
|
|
.setInputStream(marcInputStream)
|
|
.setCharset(Charset.forName("ANSEL"))
|
|
.setSchema(MARC21_FORMAT)
|
|
.build();
|
|
StringWriter sw = new StringWriter();
|
|
Result result = new StreamResult(sw);
|
|
System.setProperty("http.agent", "Java Agent");
|
|
marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);
|
|
|
|
```
|
|
|
|
And here is an example showing how records in "Aleph Sequential") can be parsed
|
|
and written into a MarcXchange collection:
|
|
|
|
```
|
|
try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
|
|
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
|
|
Marc marc = Marc.builder()
|
|
.setInputStream(in)
|
|
.setCharset(StandardCharsets.UTF_8)
|
|
.setMarcListener(writer)
|
|
.build();
|
|
marc.wrapIntoCollection(marc.aleph());
|
|
}
|
|
```
|
|
|
|
## Bibliographic character sets
|
|
|
|
Bibliographic character sets predate the era of Unicode. Before Unicode, characters sets were
|
|
scattered into several standards. Bibliographic standards were defined on several
|
|
bibliographic characters sets. Since Unicode, UTF-8 encoding has been accepted as
|
|
the de-facto standard, which fit into XML and JSON, but processing input data that was
|
|
created by using bibliographic standards still requires handling of ancient and exotic
|
|
encodings.
|
|
|
|
Because Java JDK does not provide bibliographic character sets from before the Unicode era,
|
|
it must be extended by a a bibliographic character set library.
|
|
it is recommended to use http://github.com/xbib/bibliographic-character-sets if the input data is encoded in ANSEL/Z39.47 or ISO 5426.
|
|
|
|
## Usage
|
|
|
|
You can use the library with Gradle
|
|
|
|
```
|
|
"org.xbib:marc:1.0.1"
|
|
```
|
|
|
|
or with Maven
|
|
|
|
```
|
|
<dependency>
|
|
<groupId>org.xbib</groupId>
|
|
<artifactId>marc</artifactId>
|
|
<version>1.0.1</version>
|
|
</dependency>
|
|
```
|
|
|
|
## Documentation
|
|
|
|
TODO
|
|
|
|
## Issues
|
|
|
|
All contributions are welcome. If you find bugs, want to comment, or send a pull request,
|
|
just open an issue at https://github.com/xbib/marc/issues
|
|
|
|
## MARC4J
|
|
|
|
This project was inspired by MARC4J, but is not related to MARC4J or makes reuse of the
|
|
source code. It is a completeley new implementation.
|
|
|
|
There is a MARC4J fork at https://github.com/ksclarke/freelib-marc4j where Kevin S. Clarke
|
|
implements modern Java features into the MARC4J code base.
|
|
|
|
For the curious, I tried to compile a feature comparison table to highlight some differences.
|
|
I am not very familiar with MARC4J, so I appreciate any hints, comments, or corrections.
|
|
|
|
.Feature comparison of MARC4J to xbib MARC
|
|
|===
|
|
| |MARC4J | xbib MARC
|
|
|
|
|started by
|
|
|Bas Peters
|
|
|Jörg Prante
|
|
|
|
|Project start
|
|
|2001
|
|
|2016
|
|
|
|
|Java
|
|
|Java 5
|
|
|Java 8
|
|
|
|
|Build
|
|
|Ant
|
|
|Gradle
|
|
|
|
|Supported formats
|
|
| ISO 2709/Z39.2,
|
|
MARC (USMARC, MARC 21, MARC XML),
|
|
tries to parse MARC-like formats with a "permissive" parser
|
|
| ISO 2709/Z39.2,
|
|
MARC (USMARC, MARC 21, MARC XML),
|
|
MarcXchange (ISO 25577:2013),
|
|
UNIMARC,
|
|
MAB (MAB2, MAB XML),
|
|
dialects of MARC (Aleph Sequential, Pica, SISIS format)
|
|
|
|
| Bibliographic character set support
|
|
| builtin, auto-detectable
|
|
| dynamically, via Java `Charset` API, no autodetection
|
|
|
|
| Processing
|
|
| iterator-based
|
|
| iterator-based, iterable-based, Java 8 streams for fields, records
|
|
|
|
| Transformations
|
|
|
|
|
| on-the-fly, pattern-based filtering for tags/values, field key mapping, field value transformations
|
|
|
|
| Cleaning
|
|
|
|
|
| substitute invalid characters with a pattern replacement input stream
|
|
|
|
| Statistics
|
|
|
|
|
| can count tag/indicator/subfield combination occurences
|
|
|
|
| Concurrency support
|
|
|
|
|
| can write to handlers record by record, provides a `MarcRecordAdapter` to turn MARC field events into record events
|
|
|
|
| JUnit test coverage
|
|
|
|
|
| extensive testing over all MARC dialects, >80% code coverage
|
|
|
|
| Source Quality Profile
|
|
|
|
|
| https://sonarqube.com/overview?id=1109967[Sonarqube]
|
|
|
|
| Jar size
|
|
| 447 KB (2.7.0)
|
|
| 142 KB (1.0.0)
|
|
|
|
|License
|
|
|LGPL
|
|
|Apache
|
|
|
|
|===
|
|
|
|
# License
|
|
|
|
Copyright (C) 2016 Jörg Prante
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
you may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License. |