diff --git a/README.adoc b/README.adoc index ea253ed..f9f6fef 100644 --- a/README.adoc +++ b/README.adoc @@ -2,9 +2,9 @@ ## Bibliographic data processing library for Java -image::https://api.travis-ci.org/xbib/marc.svg[Build status] -image::https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage] -image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central] +image:https://api.travis-ci.org/xbib/marc.svg[Build status] +image:https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage] +image:https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central] This is a Java library for processing bibliographic data in the following formats: @@ -40,7 +40,8 @@ part of this package. Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection. -``` +[source,java] +---- try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) { Marc.builder() .setInputStream(in) @@ -49,11 +50,12 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) { .build() .writeCollection(); } -``` +---- Here is an example to create MODS from an ISO 2709 stream -``` +[source,java] +---- Marc marc = Marc.builder() .setInputStream(marcInputStream) .setCharset(Charset.forName("ANSEL")) @@ -63,13 +65,13 @@ StringWriter sw = new StringWriter(); Result result = new StreamResult(sw); System.setProperty("http.agent", "Java Agent"); marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result); - -``` +---- And here is an example showing how records in "Aleph Sequential") can be parsed and written into a MarcXchange collection: -``` +[source,java] +---- try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true) .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) { Marc marc = Marc.builder() @@ -79,7 +81,51 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true) .build(); marc.wrapIntoCollection(marc.aleph()); } -``` +---- + +Another example, writing compressed Elasticsearch bulk format JSON from an ANSEL MARC input stream: + +[source,java] +---- +MarcValueTransformers marcValueTransformers = new MarcValueTransformers(); +// normalize ANSEL diacritics +marcValueTransformers.setMarcValueTransformer(value -> Normalizer.normalize(value, Normalizer.Form.NFC)); +// split at 10000 records, select Elasticsearch bulk format, set buffer size 65536, gzip compress = true +try (MarcJsonWriter writer = new MarcJsonWriter("bulk%d.jsonl.gz", 10000, + MarcJsonWriter.Style.ELASTICSEARCH_BULK, 65536, true) + .setIndex("testindex", "testtype")) { + writer.setMarcValueTransformers(marcValueTransformers); + Marc.builder() + .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT) + .setType(MarcXchangeConstants.BIBLIOGRAPHIC_TYPE) + .setInputStream(in) + .setCharset(Charset.forName("ANSEL")) + .setMarcListener(writer) + .build() + .writeCollection(); + +} + +---- + +where the result can be indexed by a simple bash script using `curl`, because our JSON +format is compatible to Elasticsearch JSON (which is a key/value format serializable JSON). + +[source,bash] +---- +#!/usr/bin/env bash +# This example file sends compressed JSON lines formatted files to Elasticsearch bulk endpoint +# It assumes the index settings and the mappings are already created and configured. + +for f in bulk*.jsonl.gz; do + curl -XPOST -H "Accept-Encoding: gzip" -H "Content-Encoding: gzip" \ + --data-binary @$f --compressed localhost:9200/_bulk +done +---- + +The result is a very basic MARC field based index, which is cumbersome to configure, search and analyze. +In upcoming projects, I will show how to turn MARC into semantic data with context, +and indexing such data makes much more sense and is also more fun. ## Bibliographic character sets @@ -99,7 +145,7 @@ it is recommended to use http://github.com/xbib/bibliographic-character-sets if You can use the library with Gradle ``` - "org.xbib:marc:1.0.2" + "org.xbib:marc:1.0.3" ``` or with Maven @@ -108,7 +154,7 @@ or with Maven org.xbib marc - 1.0.2 + 1.0.3 ``` @@ -118,6 +164,8 @@ TODO ## Issues +The XSLT transformation is broken in Java 8u102. Please use Java 8u92. + All contributions are welcome. If you find bugs, want to comment, or send a pull request, just open an issue at https://github.com/xbib/marc/issues