add example for Elasticsearch JSON generation to the README

2016-09-28 12:03:41 +02:00 · 2016-09-28 12:03:41 +02:00 · 957209c99e
commit 957209c99e
parent 1df7b07410
1 changed files with 60 additions and 12 deletions
--- a/README.adoc
+++ b/README.adoc
@ -2,9 +2,9 @@

 ## Bibliographic data processing library for Java

-image::https://api.travis-ci.org/xbib/marc.svg[Build status]
-image::https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
-image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]
+image:https://api.travis-ci.org/xbib/marc.svg[Build status]
+image:https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
+image:https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]

 This is a Java library for processing bibliographic data in the following formats:

@ -40,7 +40,8 @@ part of this package.

 Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.

-```
+[source,java]
+----
 try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
    Marc.builder()
            .setInputStream(in)
@ -49,11 +50,12 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
            .build()
            .writeCollection();
 }
-```
+----

 Here is an example to create MODS from an ISO 2709 stream

-```
+[source,java]
+----
 Marc marc = Marc.builder()
        .setInputStream(marcInputStream)
        .setCharset(Charset.forName("ANSEL"))
@ -63,13 +65,13 @@ StringWriter sw = new StringWriter();
 Result result = new StreamResult(sw);
 System.setProperty("http.agent", "Java Agent");
 marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);
-
-```
+----

 And here is an example showing how records in "Aleph Sequential") can be parsed
 and written into a MarcXchange collection:

-```
+[source,java]
+----
 try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
        .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
    Marc marc = Marc.builder()
@ -79,7 +81,51 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
            .build();
    marc.wrapIntoCollection(marc.aleph());
 }
-```
+----
+
+Another example, writing compressed Elasticsearch bulk format JSON from an ANSEL MARC input stream:
+
+[source,java]
+----
+MarcValueTransformers marcValueTransformers = new MarcValueTransformers();
+// normalize ANSEL diacritics
+marcValueTransformers.setMarcValueTransformer(value -> Normalizer.normalize(value, Normalizer.Form.NFC));
+// split at 10000 records, select Elasticsearch bulk format, set buffer size 65536, gzip compress = true
+try (MarcJsonWriter writer = new MarcJsonWriter("bulk%d.jsonl.gz", 10000,
+        MarcJsonWriter.Style.ELASTICSEARCH_BULK, 65536, true)
+        .setIndex("testindex", "testtype")) {
+    writer.setMarcValueTransformers(marcValueTransformers);
+    Marc.builder()
+            .setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)
+            .setType(MarcXchangeConstants.BIBLIOGRAPHIC_TYPE)
+            .setInputStream(in)
+            .setCharset(Charset.forName("ANSEL"))
+            .setMarcListener(writer)
+            .build()
+            .writeCollection();
+
+}
+
+----
+
+where the result can be indexed by a simple bash script using `curl`, because our JSON
+format is compatible to Elasticsearch JSON (which is a key/value format serializable JSON).
+
+[source,bash]
+----
+#!/usr/bin/env bash
+# This example file sends compressed JSON lines formatted files to Elasticsearch bulk endpoint
+# It assumes the index settings and the mappings are already created and configured.
+
+for f in bulk*.jsonl.gz; do
+  curl -XPOST -H "Accept-Encoding: gzip" -H "Content-Encoding: gzip" \
+   --data-binary @$f --compressed localhost:9200/_bulk
+done
+----
+
+The result is a very basic MARC field based index, which is cumbersome to configure, search and analyze.
+In upcoming projects, I will show how to turn MARC into semantic data with context,
+and indexing such data makes much more sense and is also more fun.

 ## Bibliographic character sets

@ -99,7 +145,7 @@ it is recommended to use http://github.com/xbib/bibliographic-character-sets if
 You can use the library with Gradle

 ```
-    "org.xbib:marc:1.0.2"
+    "org.xbib:marc:1.0.3"
 ```

 or with Maven
@ -108,7 +154,7 @@ or with Maven
   <dependency>
     <groupId>org.xbib</groupId>
     <artifactId>marc</artifactId>
-     <version>1.0.2</version>
+     <version>1.0.3</version>
   </dependency>
 ```

@ -118,6 +164,8 @@ TODO

 ## Issues

+The XSLT transformation is broken in Java 8u102. Please use Java 8u92.
+
 All contributions are welcome. If you find bugs, want to comment, or send a pull request,
 just open an issue at https://github.com/xbib/marc/issues