joerg/marc - Forgejo: Git with a cup of tea

joerg/marc

No description

Find a file

Jörg Prante 9e6d66ec87 add travis secure		2016-09-15 20:25:37 +02:00
config/checkstyle	initial commit	2016-09-15 17:07:08 +02:00
gradle	initial commit	2016-09-15 17:07:08 +02:00
src	initial commit	2016-09-15 17:07:08 +02:00
.gitignore	initial commit	2016-09-15 17:07:08 +02:00
.travis.yml	add travis secure	2016-09-15 20:25:37 +02:00
build.gradle	initial commit	2016-09-15 17:07:08 +02:00
CREDITS.txt	initial commit	2016-09-15 17:07:08 +02:00
gradle.properties	initial commit	2016-09-15 17:07:08 +02:00
gradlew	initial commit	2016-09-15 17:07:08 +02:00
gradlew.bat	initial commit	2016-09-15 17:07:08 +02:00
LICENSE.txt	initial commit	2016-09-15 17:07:08 +02:00
README.adoc	initial commit	2016-09-15 17:07:08 +02:00
settings.gradle	initial commit	2016-09-15 17:07:08 +02:00

README.adoc

# MARC bibliographic data processing library for Java

image::https://api.travis-ci.org/xbib/marc.svg[Build status]

image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]

This is a Java library for processing bibliographic data in the following formats:

- ISO 2709/Z39.2
- MARC (USMARC, MARC 21, MARC XML)
- MarcXchange (ISO 25577:2013)
- UNIMARC
- MAB (MAB2, MAB XML)
- dialects of MARC (Aleph Sequential, Pica, SISIS format)

The motivation of this library is to transport bibliographic data into XML or JSON based formats,
with the focus on european/german application environment.

The most known and widespread bibliographic data format is MARC, which stands for "machine readable cataloging"
and was developed by the Library of Congress 1968. Inspired by the success of MARC, several other formats, mostly based
on MARC, were developed in the 1970s, some very similar, some with significant differences. Most notable
is the UNICODE format, developed by IFLA.

MARC does not offer the features of XML or JSON, it is not a document format
or a format for the Web. MARC is stream-based and was targeted to write records on magnetic tape.
Today, magnetic tape distributions are history. Also, file distribution via FTP, common in the 1990s, does not fit
well into a highly linked information infrastructure like the Seamntic Web.

This library offers the first step to transport MARC data into systems in use today,
by writing MARC fields to XML or JSON formats. More steps would include the generation of
graph structures (RDF triples) by processing records in context, but that is not part of this package.

The library provides a fluent interface and a rich set of input streams, content handlers and listeners.
Provided are writers for XML, stylesheet transformations (MODS), and a JSON writer for
key/value-oriented JSON, suitable for indexing into Elasticsearch.

Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.

```
try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
    Marc.builder()
            .setInputStream(in)
            .setCharset(Charset.forName("ANSEL"))
            .setMarcListener(writer)
            .build()
            .writeCollection();
}
```

Here is an example to create MODS from an ISO 2709 stream

```
Marc marc = Marc.builder()
        .setInputStream(marcInputStream)
        .setCharset(Charset.forName("ANSEL"))
        .setSchema(MARC21_FORMAT)
        .build();
StringWriter sw = new StringWriter();
Result result = new StreamResult(sw);
System.setProperty("http.agent", "Java Agent");
marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);

```

And here is an example shwoing how records in "Aleph Sequential") can be parsed
and written into a MarcXchange collection:

```
try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
        .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
    Marc marc = Marc.builder()
            .setInputStream(in)
            .setCharset(StandardCharsets.UTF_8)
            .setMarcListener(writer)
            .build();
    marc.wrapIntoCollection(marc.aleph());
}
```

## Bibliographic character sets

Bibliographic character sets predate the era of Unicode. Before Unicode, characters sets were
scattered into several standards. Bibliographic standards were defined on several
bibliographic characters sets. Since Unicode, UTF-8 encoding has been accepted as
the de-facto standard, which fit into XML and JSON, but processing input data that was
created by using bibliographic standards still requires handling of ancient and exotic
encodings.

Because Java JDK does not provide  bibliographic character sets from before the Unicode era,
it must be extended by a  a bibliographic character set library.
it is recommended to use http://github.com/xbib/bibliographic-character-sets if the input data is encoded in ANSEL/Z39.47 or ISO 5426.

## Usage

You can use the library with Gradle

```
    "org.xbib:marc:1.0.0"
```

or with Maven

```
   <dependency>
     <groupId>org.xbib</groupId>
     <artifactId>marc</artifactId>
     <version>1.0.0</version>
   </dependency>
```

## MARC4J

This project was inspired by MARC4J, but is not related to MARC4J or makes reuse of the
source code. It is a completeley new implementation. For the curious, I tried to
compile a feature comparison table to highlight some differences.

There is a MARC4J fork at https://github.com/ksclarke/freelib-marc4j where Kevin S. Clarke
implements modern Java features into the MARC4J code base.

I am not experienced with MARC4J, so I appreciate any hints, commments, or corrections.

.Feature comparison to MARC4J
|===
| |MARC4J |MARC

|started by
|Bas Peters
|Jörg Prante

|Project start
|2001
|2016

|Java
|Java 5
|Java 8

|Build
|Ant
|Gradle

|Supported formats
| ISO 2709/Z39.2,
  MARC (USMARC, MARC 21, MARC XML),
  tries to parse MARC-like formats with a "permissive" parser
| ISO 2709/Z39.2,
  MARC (USMARC, MARC 21, MARC XML),
  MarcXchange (ISO 25577:2013),
  UNIMARC,
  MAB (MAB2, MAB XML),
  dialects of MARC (Aleph Sequential, Pica, SISIS format)

| Bibliographic character set support
| builtin, auto-detectable
| dynamically, via Java `Charset` API, no autodetection

| Processing
| iterator-based
| iterator-based, iterable-based, Java 8 streams for fields, records

| Transformations
|
| on-the-fly, pattern-based filtering for tags/values, field key mapping, field value transformations

| Cleaning
|
| substitute invalid characters with a pattern replacement input stream

| Statistics
|
| can count tag/indicator/subfield combination occurences

| Concurrency support
|
| can write to handlers record by record, provides a `MarcRecordAdapter` to turn MARC field events into record events

| JUnit test coverage
|
| extensive testing over all MARC dialects, >80% code coverage

| Source Quality Profile
|
| https://sonarqube.com/overview?id=1109967[Sonarqube]

| Jar size
| 447 KB (2.7.0)
| 142 KB (1.0.0)

|License
|LGPL
|Apache

|===

# License

Copyright (C) 2016 Jörg Prante

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
you may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.