Wednesday, October 31, 2012

Migrating from OracleXE to SciDB - the recommended way

This post follows from a previous post describing my initial data migraiton efforts:
http://dllahr.blogspot.com/2012/10/migrating-data-from-oraclexe-to-scidb.html

Thanks to Paul on SciDB forum for showing me another way to load the data without having to write my own script to generate the SciDB formatted data files.  This method is probably much safer in that if the SciDB format changes I don't need to worry about updating my script.
http://www.scidb.org/forum/viewtopic.php?f=11&t=598

The recommended way

Monday, October 29, 2012

Migrating data from OracleXE to SciDB (outdated)

Edit:  Read about the easier, recommended method here: http://dllahr.blogspot.com/2012/10/migrating-from-oraclexe-to-scidb.html 


Migrating from OracleXE - choosing a new database (SciDB vs. MySQL)

I hit the 11 GB limit on OracleXE for my home project, so I needed to find a new database.  I quickly settled on either using MySQL or SciDB.

MySQL is another relational database, but opensource / there is a community edition available for free without a size limit*

SciDB is a new type of database designed for "big data" and doing advanced mathematics on that data.

Here are the basic pros and cons I came up with for each based on my situation

MySQL:

  • pros:
    • well established
    • easy to install
    • easy (probably) to switch my hibernate code to use instead of XE
    • would learn about the a major Oracle alternative (haven't used it in ~5 years)
  • cons:
    • probably at best as fast as OracleXE for linear algebra operations, possibly slower
    • not sure how / if possible to run on multiple processors
SciDB
  • pros:
    • designed / optimized for linear algebra
    • learn something completely new
    • designed to be scalable and run calculations in parallel
  • cons:
    • experimental / not well established
    • not sure if it works with hibernate

Next step:  timebox SciDB investigation

I decided to set aside 4 hours of work to see if I could get SciDB up and running.  If it took longer than that, then I made the guess that I would continue to have chronic problems using it.  In the meantime, as a backup I installed MySQL on my system.

System:  Dell laptop running Win7.  8 GB RAM.  Pentium with 8 "threads" (~= cores).  I have VMWare player with an installation of Ubuntu 11.10.
SciDB runs on linux so all my work with SciDB was done on the Ubuntu running on the VMWare player.

I attempted to install SciDB v12.3 from binaries and could not resolve all the depedencies.  They are specified for 11.04, so that may have been the issue.

Install from source code worked fine - it was relatively easy and fast (beat the timebox by a lot).  I followed the instructions in the manual.  There was a minor problem with the config file in the documentation, as was explained in this forum post:
http://www.scidb.org/forum/viewtopic.php?f=11&t=506&p=828&hilit=803#p828

You need to create a user called "scidb" with sudo privileges, and then when you execute commands log in as the scidb user.  

I also wrote it up on the scidb forum:

Decision:  SciDB

At this point I decided to go with SciDB.  It passed the install timebox test very well and I decided I would rather learn about the newer SciDB than MySQL.

Edit: Probably does not work with hibernate

It appears that SciDB only currently provides connectors for Python and C/C++.  My plan therefore is to do a partial migration - the large, matrix-based data will be migrated to SciDB, while the relational information will stay in OracleXE.  It probably would make sense to migrate the relational information to the Postgres instance that is associated with the SciDB system, but I'm going to just focus on the first part. for now.