add example for Elasticsearch JSON generation to the README
This commit is contained in:
parent
1df7b07410
commit
957209c99e
1 changed files with 60 additions and 12 deletions
72
README.adoc
72
README.adoc
|
@ -2,9 +2,9 @@
|
||||||
|
|
||||||
## Bibliographic data processing library for Java
|
## Bibliographic data processing library for Java
|
||||||
|
|
||||||
image::https://api.travis-ci.org/xbib/marc.svg[Build status]
|
image:https://api.travis-ci.org/xbib/marc.svg[Build status]
|
||||||
image::https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
|
image:https://img.shields.io/sonar/http/nemo.sonarqube.com/org.xbib:marc/coverage.svg?style=flat-square[Coverage]
|
||||||
image::https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]
|
image:https://maven-badges.herokuapp.com/maven-central/org.xbib/marc/badge.svg[Maven Central]
|
||||||
|
|
||||||
This is a Java library for processing bibliographic data in the following formats:
|
This is a Java library for processing bibliographic data in the following formats:
|
||||||
|
|
||||||
|
@ -40,7 +40,8 @@ part of this package.
|
||||||
|
|
||||||
Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.
|
Here is a code example for reading from an ISO 2709 stream and writing into a MarcXchange collection.
|
||||||
|
|
||||||
```
|
[source,java]
|
||||||
|
----
|
||||||
try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
|
try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
|
||||||
Marc.builder()
|
Marc.builder()
|
||||||
.setInputStream(in)
|
.setInputStream(in)
|
||||||
|
@ -49,11 +50,12 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out)) {
|
||||||
.build()
|
.build()
|
||||||
.writeCollection();
|
.writeCollection();
|
||||||
}
|
}
|
||||||
```
|
----
|
||||||
|
|
||||||
Here is an example to create MODS from an ISO 2709 stream
|
Here is an example to create MODS from an ISO 2709 stream
|
||||||
|
|
||||||
```
|
[source,java]
|
||||||
|
----
|
||||||
Marc marc = Marc.builder()
|
Marc marc = Marc.builder()
|
||||||
.setInputStream(marcInputStream)
|
.setInputStream(marcInputStream)
|
||||||
.setCharset(Charset.forName("ANSEL"))
|
.setCharset(Charset.forName("ANSEL"))
|
||||||
|
@ -63,13 +65,13 @@ StringWriter sw = new StringWriter();
|
||||||
Result result = new StreamResult(sw);
|
Result result = new StreamResult(sw);
|
||||||
System.setProperty("http.agent", "Java Agent");
|
System.setProperty("http.agent", "Java Agent");
|
||||||
marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);
|
marc.transform(new URL("http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl"), result);
|
||||||
|
----
|
||||||
```
|
|
||||||
|
|
||||||
And here is an example showing how records in "Aleph Sequential") can be parsed
|
And here is an example showing how records in "Aleph Sequential") can be parsed
|
||||||
and written into a MarcXchange collection:
|
and written into a MarcXchange collection:
|
||||||
|
|
||||||
```
|
[source,java]
|
||||||
|
----
|
||||||
try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
|
try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
|
||||||
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
|
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)) {
|
||||||
Marc marc = Marc.builder()
|
Marc marc = Marc.builder()
|
||||||
|
@ -79,7 +81,51 @@ try (MarcXchangeWriter writer = new MarcXchangeWriter(out, true)
|
||||||
.build();
|
.build();
|
||||||
marc.wrapIntoCollection(marc.aleph());
|
marc.wrapIntoCollection(marc.aleph());
|
||||||
}
|
}
|
||||||
```
|
----
|
||||||
|
|
||||||
|
Another example, writing compressed Elasticsearch bulk format JSON from an ANSEL MARC input stream:
|
||||||
|
|
||||||
|
[source,java]
|
||||||
|
----
|
||||||
|
MarcValueTransformers marcValueTransformers = new MarcValueTransformers();
|
||||||
|
// normalize ANSEL diacritics
|
||||||
|
marcValueTransformers.setMarcValueTransformer(value -> Normalizer.normalize(value, Normalizer.Form.NFC));
|
||||||
|
// split at 10000 records, select Elasticsearch bulk format, set buffer size 65536, gzip compress = true
|
||||||
|
try (MarcJsonWriter writer = new MarcJsonWriter("bulk%d.jsonl.gz", 10000,
|
||||||
|
MarcJsonWriter.Style.ELASTICSEARCH_BULK, 65536, true)
|
||||||
|
.setIndex("testindex", "testtype")) {
|
||||||
|
writer.setMarcValueTransformers(marcValueTransformers);
|
||||||
|
Marc.builder()
|
||||||
|
.setFormat(MarcXchangeConstants.MARCXCHANGE_FORMAT)
|
||||||
|
.setType(MarcXchangeConstants.BIBLIOGRAPHIC_TYPE)
|
||||||
|
.setInputStream(in)
|
||||||
|
.setCharset(Charset.forName("ANSEL"))
|
||||||
|
.setMarcListener(writer)
|
||||||
|
.build()
|
||||||
|
.writeCollection();
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
where the result can be indexed by a simple bash script using `curl`, because our JSON
|
||||||
|
format is compatible to Elasticsearch JSON (which is a key/value format serializable JSON).
|
||||||
|
|
||||||
|
[source,bash]
|
||||||
|
----
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# This example file sends compressed JSON lines formatted files to Elasticsearch bulk endpoint
|
||||||
|
# It assumes the index settings and the mappings are already created and configured.
|
||||||
|
|
||||||
|
for f in bulk*.jsonl.gz; do
|
||||||
|
curl -XPOST -H "Accept-Encoding: gzip" -H "Content-Encoding: gzip" \
|
||||||
|
--data-binary @$f --compressed localhost:9200/_bulk
|
||||||
|
done
|
||||||
|
----
|
||||||
|
|
||||||
|
The result is a very basic MARC field based index, which is cumbersome to configure, search and analyze.
|
||||||
|
In upcoming projects, I will show how to turn MARC into semantic data with context,
|
||||||
|
and indexing such data makes much more sense and is also more fun.
|
||||||
|
|
||||||
## Bibliographic character sets
|
## Bibliographic character sets
|
||||||
|
|
||||||
|
@ -99,7 +145,7 @@ it is recommended to use http://github.com/xbib/bibliographic-character-sets if
|
||||||
You can use the library with Gradle
|
You can use the library with Gradle
|
||||||
|
|
||||||
```
|
```
|
||||||
"org.xbib:marc:1.0.2"
|
"org.xbib:marc:1.0.3"
|
||||||
```
|
```
|
||||||
|
|
||||||
or with Maven
|
or with Maven
|
||||||
|
@ -108,7 +154,7 @@ or with Maven
|
||||||
<dependency>
|
<dependency>
|
||||||
<groupId>org.xbib</groupId>
|
<groupId>org.xbib</groupId>
|
||||||
<artifactId>marc</artifactId>
|
<artifactId>marc</artifactId>
|
||||||
<version>1.0.2</version>
|
<version>1.0.3</version>
|
||||||
</dependency>
|
</dependency>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -118,6 +164,8 @@ TODO
|
||||||
|
|
||||||
## Issues
|
## Issues
|
||||||
|
|
||||||
|
The XSLT transformation is broken in Java 8u102. Please use Java 8u92.
|
||||||
|
|
||||||
All contributions are welcome. If you find bugs, want to comment, or send a pull request,
|
All contributions are welcome. If you find bugs, want to comment, or send a pull request,
|
||||||
just open an issue at https://github.com/xbib/marc/issues
|
just open an issue at https://github.com/xbib/marc/issues
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue