A add-on for the Apache Solr Data Import Handler.
This project adds an entity processor that handles bibliographic records. It supports the metamorph DSL for data extraction.
mvn package
Produces solr-metamorph-entity-processor-VERSION-jar-with-dependencies.jar in target .
Assuming a fresh Solr installation.
-
Download the latest version
-
Unzip
-
Directory of your Solr installation is
solr-VERSION(e.g. solr-7.4.0)
A list of Solr directories:
| Name | Location | Example |
|---|---|---|
SOLR_ROOT |
Path to the unpacked solr distribution |
/srv/solr-7.4.0 |
SOLR_SERVER_DIR |
SOLR_ROOT/server |
/srv/solr-7.4.0/server |
SOLR_HOME |
SOLR_ROOT/server/solr |
/srv/solr-7.4.0/server/solr |
Create the directory SOLR_ROOT/lib:
mkdir -p SOLR_ROOT/lib
mkdir -p SOLR_ROOT/lib/metafacture
Copy all Metafacture Module JARs into SOLR_ROOT/lib/metafacture and into SOLR_ROOT/server/solr-webapp/webapp/WEB-INF/lib.
cd SOLR_ROOT/lib/metafacture
repo="http://central.maven.org/maven2/org/metafacture"
modules="metafacture-biblio metafacture-commons metafacture-flowcontrol metafacture-framework metafacture-io metafacture-mangling metamorph metamorph-api"
for module in $modules; do
wget -q -P $(realpath ${SOLR})/lib/metafacture ${repo}/${module}/${METAFACTURE_VERSION}/${module}-${METAFACTURE_VERSION}.jar
done
mkdir -p SOLR_ROOT/lib/dih
Copy the latest release JAR into SOLR_ROOT/lib/dih.
|
Note
|
Assumes a existing core (you may use a default core). Edit the solrconfig.xml of your core. |
Enable the Data Import Handler and the processor by adding the following
lib statements to the solrconfig.xml of your config set:
<!-- Data Import Handler -->
<lib dir="\${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<!-- Metafacture -->
<lib dir="\${solr.install.dir:../../../..}/lib/metafacture" regex="metafacture-.*\.jar" />
<!-- Data Import Handler Add-Ons -->
<lib dir="\${solr.install.dir:../../../..}/lib/dih" regex="solr-metamorph-entity-processor-.*\.jar" />
Add the /dataimport request handle to the solrconfig.xml:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">solr-data-config.xml</str>
</lst>
</requestHandler>
|
Tip
|
A example solr-data-config.xml is located in example/solr-data-config.xml.
|
|
Note
|
Test data are located in example/testdata.mrc. The solr-data-config.xml expects them in /tmp.
|
This MetamorphEntityProcessor reads all content from the data source on a record by record basis. This processor may handle compressed input streams, if the consumed data source is a BinFileDataSource.
Each record is processed by a metafacture pipeline that uses metamorph to extract fields.
The Metamorph Entity Processor has the following attributes:
- url
-
Required. A attribute that specifies the location of the input file in a way that is compatible with the configured data source.
- format
-
Required. The format supplied by the data source.
- Supported Formats
-
-
marc21
-
Pre-processing records by replacing newline and carriage return with a space
-
-
marcxml
-
Pre-processing records by converting marcxml into marc21 and using the marc21 pre-processing (see above).
-
if includeFullRecord=true, the implicit field fullRecord contains the MARC21 representation of the record.
-
-
- morphDef
-
Required. The metamorph definition files that are used for field extraction. Each extracted field is added as a implicit field. If the input is a list of files (separated by a comma), the data get passed from one metamorph file to another. Those files are located inside the config set’s conf directory. :: Make sure that your metamorph definition xml has the following properties:
-
The encoding of the file should be UTF-8
-
Validate the file encoding with a text editor
-
-
Check for control characters, if you use XML 1.0
-
ASCII control characters are not legally encodeable in XML 1.0
-
-
- includeFullRecord
-
An optional attribute that adds the received record to the implicit field
fullRecord. The attribute is a boolean value (true or false), that is false by default. - onError
-
By default the MetamorphEntityProcessor will stop processing documents, if it finds one that generates an error. If you set onError to "skip", the MetamorphEntityProcessor will instead skip documents that fail processing. A debug message will be created that contains the record and the cause of the failure.
For example:
<entity name="morph"
processor="org.culturegraph.solr.handler.dataimport.MetamorphEntityProcessor"
url="path/to/file.marc21"
inputFormat="marc21"
morphDef="morph.xml,morph2.xml"
includeFullRecord="true"
onError="skip">
<field column="identifier" name="id"/>
<field column="fullRecord" name="fullRecord_s"/>
</entity>The used metamorph definitions:
<?xml version="1.0" encoding="UTF-8"?>
<!-- morph.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
<rules>
<data name="idn" source="001"/>
</rules>
</metamorph><?xml version="1.0" encoding="UTF-8"?>
<!-- morph2.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
<rules>
<data name="identifier" source="idn"/>
</rules>
</metamorph>Run a full-import:
curl -s http://localhost:1111/solr/demo/dataimport?command=full-import
Check status:
curl -s http://localhost:1111/solr/demo/dataimport?command=status
Commit:
curl -s http://localhost:1111/solr/demo/update?commit=true
- NOTE
-
The admin UI provides a Dataimport Screen .
A record processed by metamorph will be transformed into a intermediate representation (IR) that consists of the following elements:
-
Record
-
Entity
-
Literal
A row processed by Solr is a map that consists of key-value or key-list pairs.
startRecord("001")
literal("date", "20181001")
startEntity("person")
literal("lastname", "Unknown")
endEntity()
literal("cat", "human")
literal("cat", "person")
endRecord()
{
"cat": ["human", "person"]
"date": "20181001"
"personLastname": "Unknown"
}
The following rules are applied to convert a IR to a Row:
-
Record id will be ignored
-
Literals with the same name form a list
-
Literal names in entities are prefixed with the entity name in CamelCase