ETL - Extractors
Extractor components are the first part of the ETL process and are responsible for the extracting of data.
Available Extractors
row | jdbc | json | csv | xml |
---|---|---|---|---|
row
Extracts content row by row.
- Component name: row
- Output class: [String]
Syntax
Parameter | Description | Type | Mandatory | Default value |
---|---|---|---|---|
multiLine | Supports multi line. This is useful with CSV supporting linefeed inside strings. As of 2.0.9 | boolean | false | true |
lineFeed | Linefeed to use in case of multiline (see above). Since 2.0.9 | string | false | \r\n |
Example with default configuration
{ "row": {} }
csv (as of v2.1.4)
Extract content from csv files. Apache Commons-csv is used to parse csv files. This component is avaliable as of version 2.1.4
- Component name. csv
- Output class: [ODocument]
Syntax
Parameter | Description | Type | Mandatory | Default value |
---|---|---|---|---|
separator | Column separator | char | false | , |
columnsOnFirstLine | Columns are described in the first line | boolean | false | true |
columns | Columns array containing names, and optionally types by postfixing names with: |
string[] | false | - |
nullValue | value to consider as NULL | string | false | NULL |
dateFormat | date format to use for parsing dates | string | false | yyy-mm-dd |
quote | String character delimiter | char | false | " |
skipFrom | Line number where start to skip | integer | false | - |
skipTo | Line number where skip ends | integer | false | - |
ignoreEmptyLines | Ignore empry lines | boolean | false | false |
predefinedFormat | Name of standard csv format (from Apache commons-csv): DEFAULT, EXCEL, MYSQL, RFC4180, TDF | string | false | - |
Documentation about commons-csv predefined format is available here: (https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html)
Examples
Extract lines from CSV (as ODocument), using comma as separator, considering "NULL" as null value and skipping the rows 2-4:
{ "csv":
{ "separator": ",",
"nullValue": "NULL",
"skipFrom": 1,
"skipTo": 3
}
}
Extract lines from a CSV exported from MYSQL:
{ "csv":
{ "predefinedFormat": "MYSQL"}
}
Extract lines from a CSV with default format using 'N/A' as null value placeholder and custom date format:
{ "csv":
{ "predefinedFormat": "DEFAULT",
"nullValue" : "N/A",
"dateFormat" : "dd-mm-yyyy HH:MM"
}
}
jdbc
Extracts data from any DBMS that support JDBC driver. In order to get the ETL component to connect to the source database, put the DBMS's JDBC driver in the classpath or $ORIENTDB_HOME/lib directory.
- Component name: jdbc
- Output class: [ODocument]
Syntax
Parameter | Description | Type | Mandatory | Default value |
---|---|---|---|---|
driver | JDBC Driver class | string | true | - |
url | JDBC URL to connect | string | true | - |
userName | DBMS User name | string | true | - |
userPassword | DBMS User password | string | true | - |
query | Query that extract the record to import | string | true | - |
queryCount | Query that return the count of the fetched records. This is used to provide a correct progress indicator | string | false | - |
Example
Extracts all the "Client" table from a MySQL database "test" hosted on localhost:
{ "jdbc": {
"driver": "com.mysql.jdbc.Driver",
"url": "jdbc:mysql://localhost/test",
"userName": "root",
"userPassword": "",
"query": "select * from Client"
}
}
json
Extracts data by parsing json objects. If the data has more json items, they must be enclosed between [].
- Component name: json
- Output class: [ODocument]
Example
{ "json": {} }
xml (as of v2.2)
Extracts the data by parsing XML.
- Component name: xml
- Output class: [ODocument]
Syntax
Parameter | Description | Type | Mandatory | Default value |
---|---|---|---|---|
rootNode | Root node to consider. By default it build a document starting from the root tag | string | false | empty |
tagsAsAttribute | array of tags where children tags are considered as attributes of document and the attribute value is the text inside the tag | string[] | false | empty |
Examples
Example 1: extract data from an XML file.
simple.xml
XML file content:
<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>
<c name='Ferrari' color='red'>ignore</c>
<c name='Maserati' color='black'/>
</b>
</a>
OrientDB ETL configuration file:
{"source": { "file": { "path": "src/test/resources/simple.xml" } }, "extractor" : { "xml": {} }, "loader": { "test": {} } }
Result:
{
"a": {
"b": {
"c": [
{
"color": "red",
"name": "Ferrari"
},
{
"color": "black",
"name": "Maserati"
}
]
}
}
}
Example 2: extract a collection from XML.
simple.xml
XML file content:
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
<CD>
<TITLE>Greatest Hits</TITLE>
<ARTIST>Dolly Parton</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>RCA</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1982</YEAR>
</CD>
</CATALOG>
OrientDB ETL configuration file:
{"source": { "file": { "path": "src/test/resources/music.xml" } }, "extractor" : { "xml": { "rootNode": "CATALOG.CD", "tagsAsAttribute": ["CATALOG.CD"] } }, "loader": { "test": {} } }
Result:
{
"TITLE": "Empire Burlesque",
"ARTIST": "Bob Dylan",
"COUNTRY": "USA",
"COMPANY": "Columbia",
"PRICE": "10.90",
"YEAR": "1985"
}
{
"TITLE": "Hide your heart",
"ARTIST": "Bonnie Tyler",
"COUNTRY": "UK",
"COMPANY": "CBS Records",
"PRICE": "9.90",
"YEAR": "1988"
}
{
"TITLE": "Greatest Hits",
"ARTIST": "Dolly Parton",
"COUNTRY": "USA",
"COMPANY": "RCA",
"PRICE": "9.90",
"YEAR": "1982"
}