ETL
The OrientDB-ETL module is an amazing tool to move data from and to OrientDB by executing an ETL process. It's super easy to use. OrientDB ETL is based on the following principles:
- one configuration file in JSON format
- one Extractor is allowed to extract data from a source
- one Loader is allowed to load data to a destination
- multiple Transformers that transform data in a pipeline. They receive something as input, do something, then return something as output that will be processed as input by the next component
How ETL works
EXTRACTOR => TRANSFORMERS[] => LOADER
An example of a process that extracts from a CSV file, applies some change, does a lookup to see if the record has already been created and then stores the record as a document against OrientDB database:
+-----------+-----------------------+-----------+
| | PIPELINE |
+ EXTRACTOR +-----------------------+-----------+
| | TRANSFORMERS | LOADER |
+-----------+-----------------------+-----------+
| FILE ==> CSV->FIELD->MERGE ==> OrientDB |
+-----------+-----------------------+-----------+
The pipeline, composed of transformation and loading phases, can run in parallel by setting the configuration {"parallel":true}
.
Installation
Starting from OrientDB v2.0 the ETL module is bundled with the official release. Follow these steps to use the module:
- Clone the repository on your computer, by executing:
git clone https://github.com/orientechnologies/orientdb-etl.git
- Compile the module, by executing:
mvn clean install
- Copy
script/oetl.sh
(or .bat under Windows) to $ORIENTDB_HOME/bin - Copy
target/orientdb-etl-2.0-SNAPSHOT.jar
to $ORIENTDB_HOME/lib
Usage
$ cd $ORIENTDB_HOME/bin
$ ./oetl.sh config-dbpedia.json
NOTE: If you are importing data for use in a distributed database, then you must set ridBag.embeddedToSbtreeBonsaiThreshold=Integer.MAX\_VALUE for the ETL process to avoid replication errors, when the database is updated online. |
Run-time configuration
In an ETL JSON file you can define variables, which will be resolved at run-time by passing them at startup. You could, for example, assign the database URL as ${databaseURL}
and then pass the database URL at execution time with:
$ ./oetl.sh config-dbpedia.json -databaseURL=plocal:/temp/mydb
Available Components
Examples: