Monday, October 29, 2012

Migrating data from OracleXE to SciDB (outdated)

So far I have only read that SciDB has connectors for Python and C/C++.  I posted on the SciDB to find out if there was a connector for Java or any recommended way to migrate data from OracleXE to SciDB, and had a similar response.
http://www.scidb.org/forum/viewtopic.php?f=6&t=593

The route I chose was to export from oracle XE as a text file and then import that into SciDB.  Ultimately it required a conversion step in between.  "Background" and "What Worked" are the short version, "initial attempt / problem" are just provided to prevent others (e.g. me when I forget) from repeating the same mistake I made.
Short version posted on the SciDB forum:
http://www.scidb.org/forum/viewtopic.php?f=11&t=598

Background

SciDB stores data as arrays.  Arrays can have multiple dimensions.  In addition to dimensions, arrays also have attributes.  Each attribute of an array defines a data type.  So for example, in Java you might have:
String[][] stringArray = new String[2][2];  //a 2x2 String array
int[][] intArray = new int[2][2];  //a 2x2 int array

in SciDB you would have:
create array intStringPair <myStr : string, myInt, int64> [i=0:1,2,0, j=0:1,1,0];

The one SciDB array has 2 attributes: myStr and myInt of type string and int64 respectively.  Each location in the array is effectively a string-int pair.

The size of the array - dimensions and size of each dimension - is set in the brackets at the end of the above expression:
[i=0:1,2,0, j=0:2,1,0]

In this case one dimension is called "i" and the other is called "j" (the dimension names are user defined).

0:1 indicates that the dimension has indices 0 or 1 (length = 2)
,2 indicates a chunk size of 2
,0 indicates an overlap of 0

Chunk size is used to describe the contiguous blocks that the array should be stored in.  See the documentation for more details, but basically the above indicates that the chunks are 2 units long in the i dimension and 1 unit long in the j dimension.   If we think of i representing rows and j representing columns, then each chunk represents a column.

For my purposes I don't need intStringPair - I need just doubles - so the equivalent array I use is generated with:
create array eigVect <value : double> [row=0:9,10,0, col=0:3,1,0];

(creates a 10x4 array of doubles)

What worked

Since my arrays are matrices for my current purposes, and the columns of the matrices have meaning to me, I ended up making my chunks as columns of the arrays.  I created some fake data using:

 build(eigVect, row*4 + col);
output:
[[(0)],[(4)],[(8)],[(12)],[(16)],[(20)],[(24)],[(28)],[(32)],[(36)]];[[(1)],[(5)],[(9)],[(13)],[(17)],[(21)],[(25)],
[(29)],[(33)],[(37)]];[[(2)],[(6)],[(10)],[(14)],[(18)],[(22)],[(26)],[(30)],[(34)],[(38)]];[[(3)],[(7)],[(11)],
[(15)],[(19)],[(23)],[(27)],[(31)],[(35)],[(39)]]


(the semicolons define the chunks)

This generates an array containing the same dimensions as eigVect where the values are generated at each location using the dimension formula.  I exported this so I could see how to format it for import:

save( (build(eigVect, row*4 + col), '/home/scidb/test.scidb');

the contents of the file /home/scidb/test.scidb are:
 {0,0}[[(0)],[(4)],[(8)],[(12)],[(16)],[(20)],[(24)],[(28)],[(32)],[(36)]];
{0,1}[[(1)],[(5)],[(9)],[(13)],[(17)],[(21)],[(25)],[(29)],[(33)],[(37)]];
{0,2}[[(2)],[(6)],[(10)],[(14)],[(18)],[(22)],[(26)],[(30)],[(34)],[(38)]];
{0,3}[[(3)],[(7)],[(11)],[(15)],[(19)],[(23)],[(27)],[(31)],[(35)],[(39)]]


The curly brackets indicate the chunk location.  The remaining numbers on each line are the contents of the chunk.

After experimenting, I noticed that the system does not care about line breaks.  So I realized that I could output a single column of data from my OracleXE table and then reformat to produce something acceptable.  Essentially the above could be equally well formatted as:
{0,0}[
[(0)],
[(4)],
[(8)],
[(12)],
[(16)],
[(20)],
[(24)],
[(28)],
[(32)],
[(36)]
];
...

The non-bold, non-blue above can be easily generated by a SQL statement.  Then, a script can can be used to insert the other pieces systematically.

Oracle query:
select '[(' || value || ')]' from eigenvector_value
where eigenvalue_id >= 401 and eigenvalue_id <= 408
order by eigenvalue_id, row_id;

produces:
[(1.1904804E-002)]
[(1.40402524E-003)]
[(4.06114245E-003)]
[(5.2728327E-003)]
...

I ran the above in SQL Developer and then right clicked and chose export, saved as space delimited (since just one entry per line), eliminated enclosing " marks, the header row, and set line termination to be Unix style.

I then processed that file using a bash script that reads in the file line by line, keeps track of the line number, and when appropriate add the
• chunk indexes and opening bracket
• commas at the end of lines
• closing bracket and semicolon
see below for the script.  I then loaded the data using a command issued from the command line:

-n is for no output
-a is for using AFL language (array function language) (as opposed to default AQL, array query language)
-q indicates a query will be provided on the command line instead of using the interactive shell

it ran in under 10s!!!

formatting Script:
 #!/bin/bash

inputFile=$1 vectorLength=$2

let lastRowIndex=$vectorLength-1 let lineNum=0 while read line do let rowIndex=$lineNum%$vectorLength # echo$rowIndex

if [ $rowIndex -eq 0 ] then if [$lineNum -gt 0 ]
then
echo "];"
fi

let columnIndex=$lineNum/$vectorLength
echo column index: $columnIndex > /dev/stderr echo "{0,$columnIndex}["
fi

if [ $rowIndex -lt$lastRowIndex ]
then
echo $line"," else echo$line
fi

let lineNum=lineNum+1
done < \$inputFile

echo "];"


Initial attempt / problem

also under the SciDB forum at:

I initially chose to use a chunk size of 1 for both dimensions.  Then, each line consisted of the chunk indexes and the data value.  This directly matches the structure in the OracleXE and thus can be generated directly from a SQL query:

select '{' || (row_id-33091) || ',' || (eigenvalue_id-501) || '}' || '[[(' || value || ')]];'
from eigenvector_value
where eigenvalue_id >= 501 and eigenvalue_id <=600
order by row_id, eigenvalue_id;

output:
{0,0}[[(-1.90783001E-003)]];
{0,1}[[(-8.21435009E-004)]];
{0,2}[[(2.80230997E-005)]];
{0,3}[[(-8.95769976E-004)]];
{0,4}[[(-2.9486001E-003)]];
...

This did not work.  When I attempted to load, it ran for awhile and eventually crashed when it ran out of space (~2.7 GB).  Note: the initial data was only ~130 MB in the above text format.  So I added a new partition, moved the SciDB over to it (very easy by the way, just copied the "data" folder in the /home/scidb over).  And tried again.  After running for 2.6 hours, it had used 8.5 GB on the partition.  I tried to abort the process using ctrl-C.  This got me back to the command line, however top indicated that the SciDB process was still chugging along.  It took the -19 kill command to stop the process.  I tried to restart SciDB, restart Ubuntu and then restart SciDB, but I couldn't issue commands.  So I re-initialized the SciDB database:
scidb.py initall test1

and then tried what I describe above under "What Worked"
posted under the SciDB forum at: