add example for Elasticsearch JSON generation to the README

This commit is contained in:
Jörg Prante 2016-09-28 12:03:41 +02:00
parent 1df7b07410
commit 957209c99e

View file

@ -2,9 +2,9 @@
## Bibliographic data processing library for Java ## Bibliographic data processing library for Java
image::https://api.travis-ci.org/xbib/marc.svg[Build status] image:https://api.travis-ci.org/xbib/marc.svg[Build status]
image::https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage] image:https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central] image:https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]
This is a Java library for processing bibliographic data in the following formats: This is a Java library for processing bibliographic data in the following formats:
@ -40,7 +40,8 @@ part of this package.
Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection. Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.
``` [source,java]
----
try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) { try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
Marc.builder() Marc.builder()
.setInputStream(in) .setInputStream(in)
@ -49,11 +50,12 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
.build() .build()
.writeCollection(); .writeCollection();
} }
``` ----
Here is an example to create MODS from an ISO 2709 stream Here is an example to create MODS from an ISO 2709 stream
``` [source,java]
----
Marc marc = Marc.builder() Marc marc = Marc.builder()
.setInputStream(marcInputStream) .setInputStream(marcInputStream)
.setCharset(Charset.forName("ANSEL")) .setCharset(Charset.forName("ANSEL"))
@ -63,13 +65,13 @@ StringWriter sw = new StringWriter();
Result result = new StreamResult(sw); Result result = new StreamResult(sw);
System.setProperty("http.agent", "Java Agent"); System.setProperty("http.agent", "Java Agent");
marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result); marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);
----
```
And here is an example showing how records in "Aleph Sequential") can be parsed And here is an example showing how records in "Aleph Sequential") can be parsed
and written into a MarcXchange collection: and written into a MarcXchange collection:
``` [source,java]
----
try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true) try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) { .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
Marc marc = Marc.builder() Marc marc = Marc.builder()
@ -79,7 +81,51 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
.build(); .build();
marc.wrapIntoCollection(marc.aleph()); marc.wrapIntoCollection(marc.aleph());
} }
``` ----
Another example, writing compressed Elasticsearch bulk format JSON from an ANSEL MARC input stream:
[source,java]
----
MarcValueTransformers marcValueTransformers = new MarcValueTransformers();
// normalize ANSEL diacritics
marcValueTransformers.setMarcValueTransformer(value -> Normalizer.normalize(value, Normalizer.Form.NFC));
// split at 10000 records, select Elasticsearch bulk format, set buffer size 65536, gzip compress = true
try (MarcJsonWriter writer = new MarcJsonWriter("bulk%d.jsonl.gz", 10000,
MarcJsonWriter.Style.ELASTICSEARCH_BULK, 65536, true)
.setIndex("testindex", "testtype")) {
writer.setMarcValueTransformers(marcValueTransformers);
Marc.builder()
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)
.setType(MarcXchangeConstants.BIBLIOGRAPHIC_TYPE)
.setInputStream(in)
.setCharset(Charset.forName("ANSEL"))
.setMarcListener(writer)
.build()
.writeCollection();
}
----
where the result can be indexed by a simple bash script using `curl`, because our JSON
format is compatible to Elasticsearch JSON (which is a key/value format serializable JSON).
[source,bash]
----
#!/usr/bin/env bash
# This example file sends compressed JSON lines formatted files to Elasticsearch bulk endpoint
# It assumes the index settings and the mappings are already created and configured.
for f in bulk*.jsonl.gz; do
curl -XPOST -H "Accept-Encoding: gzip" -H "Content-Encoding: gzip" \
--data-binary @$f --compressed localhost:9200/_bulk
done
----
The result is a very basic MARC field based index, which is cumbersome to configure, search and analyze.
In upcoming projects, I will show how to turn MARC into semantic data with context,
and indexing such data makes much more sense and is also more fun.
## Bibliographic character sets ## Bibliographic character sets
@ -99,7 +145,7 @@ it is recommended to use http://github.com/xbib/bibliographic-character-sets if
You can use the library with Gradle You can use the library with Gradle
``` ```
"org.xbib:marc:1.0.2" "org.xbib:marc:1.0.3"
``` ```
or with Maven or with Maven
@ -108,7 +154,7 @@ or with Maven
<dependency> <dependency>
<groupId>org.xbib</groupId> <groupId>org.xbib</groupId>
<artifactId>marc</artifactId> <artifactId>marc</artifactId>
<version>1.0.2</version> <version>1.0.3</version>
</dependency> </dependency>
``` ```
@ -118,6 +164,8 @@ TODO
## Issues ## Issues
The XSLT transformation is broken in Java 8u102. Please use Java 8u92.
All contributions are welcome. If you find bugs, want to comment, or send a pull request, All contributions are welcome. If you find bugs, want to comment, or send a pull request,
just open an issue at https://github.com/xbib/marc/issues just open an issue at https://github.com/xbib/marc/issues