Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

 Summary:  I am multiplying 2 matrices in SciDB.  Previously I recorded the calculation ran in 5 hours, but now I am observing / estimating it to run in 33 hours.  This post is a description of my investigation and my attempt to speed up the calculation by reconfiguring SciDB to use more processors, and ultimately running the calculation piecemeal so I could monitor its progress.  
Update:  Using a system with 4 worker instances instead of 1 decreased the time by approximately a factor of 3.
Further Update:  adding the specifier 'dense' to the multiply command increased the speed further by a factor of 1.5

Background

I have 2 matrices in SciDB that I want to multiply:
particleStem_3 is 873637 x 42315
eigVect_3 is 42315 x 100

schema:
[("particleStem_3<count:double> [stem=0:873636,20,0,particle=0:42314,42315,0]")]
[("eigVect_3<value:double> [particle=0:42314,42315,0,eig=0:99,20,0]")]

Monday, November 12, 2012

Summary of loading data and fitting regression in R

I mainly learned about R from this post about doing linear regression in R:
http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/

Of course the R-manual is great:
http://cran.r-project.org/doc/manuals/r-release/R-intro.html

However R is a huge system and when you are getting started the entire manual can be daunting, which is why I prefer a tutorial to get me started, and then start looking up more functionality in the manual.

Here are some tips and tricks that I commonly use:

Tuesday, November 6, 2012

How I investigated and shrank Oracle XE

Or how I learned to stop worrying and log in as sysdba to reset the system password

Oracle XE has a limit of 11 GB on data, which I was close to using.  However, the entire Oracle XE system was taking up 52 GB on disk.  Here is how I investigated why it was so large and how I shrunk the size.

I started with google and didn't find anything obvious.  I then went to the web-based dashboard / control for oracle XE:
localhost:8080/apex

and attempted to log in as system.  I was informed that the password was expired.  Google revealed that I should log in as sysdba to reset the password, and that I could log in in as sysdba and reset the password using:

  1. running sqlplus.exe as Administrator
  2. for username entering 
    • /as sysdba
  3. alter user system identified by newpass
I then logged into the dashboard / control and checked under the "storage" tab.  This listed the tablespaces in the system.

Saturday, November 3, 2012

Exporting from OracleXE

Here are 3 things I did to export from OracleXE with various levels of success:


  • Use SQLDeveloper to run a query, then save the results
    • pretty fast but failed for results above a certain size
  • Use SQL*Plus to run the query, spool the results to a file
    • works pretty well but has to display all of the results on the screen so that is slowdown
    • ~5% CPU usage
    • 58 minutes for 15.7 million rows
  • Use a custom written quick Java application to run the query and print the results - redirect the results to a file]
    • fast and efficient
    • ~10% CPU usage
    • 15 minutes

Thursday, November 1, 2012

Matrix Multiplication in SciDB - Chunk size requirement and approximate speed comparison to OracleXE

Having loaded some data into SciDB, I now get to the heart of the matter:  matrix multiplication.  Here is what I have for matrix dimensions:

Matrix A:  870,000 x 42000
Matrix B:  42000 x 100

I want to do
C = A x B

C will be
870,000 x 100

SciDB makes this easy:
AFL% multiply(A, B)

will produce the result, but we really want to store it, and not display it, so rather than run from within iquery, issue a command using iquery that suppresses output:

scidb@ubuntu:~$  iquery -naq "store ( multiply (A, B), C)"

(much easier than SQL equivalent involving insert ... select with joins and aggregations)

***BUT THIS DIDN'T WORK***