Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

 Summary:  I am multiplying 2 matrices in SciDB.  Previously I recorded the calculation ran in 5 hours, but now I am observing / estimating it to run in 33 hours.  This post is a description of my investigation and my attempt to speed up the calculation by reconfiguring SciDB to use more processors, and ultimately running the calculation piecemeal so I could monitor its progress.  
Update:  Using a system with 4 worker instances instead of 1 decreased the time by approximately a factor of 3.
Further Update:  adding the specifier 'dense' to the multiply command increased the speed further by a factor of 1.5

Background

I have 2 matrices in SciDB that I want to multiply:
particleStem_3 is 873637 x 42315
eigVect_3 is 42315 x 100

schema:
[("particleStem_3<count:double> [stem=0:873636,20,0,particle=0:42314,42315,0]")]
[("eigVect_3<value:double> [particle=0:42314,42315,0,eig=0:99,20,0]")]


Note:

  • the chunk size of the inner dimension (particle) matches
  • the number of elements in the chunks of each array are ~1000000 as recommended
    • particleStem_3:  20 * 42315 = 846300
      • (okay, that is some serious handwaving)
    • eigVect_3:  42315 * 20 = 846300
My investigation focused on measuring the time of the calculation as it changed with increasing the number of rows in the left most matrix.  I take a subarray of the left matrix using:
subarray(particleStem_3, 0,0,3,42314)

This returns a matrix which is 4x42315:
  • left most 0 to 3:  4 units of the first dimension
  • 2nd 0 to 42314:  42315 units of the second dimension
I nested this subarray call within the matrix multiply statement, and stored the result.  I timed the whole operation:

time iquery -naq "store(multiply(subarray(articleStem_3, 0,0,0,42314), eigVect_3), prinComp_3)"
Query was executed successfully

real 0m12.158s
user 0m0.028s
sys 0m0.012s

Time to run calculation using 1 instance

System: SciDB 12.03
Ubuntu 11.10 running under VMWare

I repeated this for increasing numbers of rows (removing prinComp_3 between runs, remove(prinComp_3)), to obtain this data:
number of rows time[s]
4 13.654
40 25.9
400 85.702
4000 620.328












































Note the log scale on the x-axis.

Conclusion based on timing with 1 worker instance

From the best-fit equation, I estimated that the entire calculation would run in 36.5 hours.  Before I worked on investigating why this was taking so long (i.e. by posting on the SciDB forum to bug the hard working developers)  I thought in the mean time I could:

  • upgrade to the latest version of SciDB
  • attempt to run the calculation in parallel by increasing the number of worker instances of SciDB
I upgraded to Ubuntu 12.04 and then was able to easily use the SciDB repository to install SciDB, and then some minor ssh configuration as outline in the manual.  Just follow the documentation instructions they are not long.

Setting the number of worker instances

The number of worker instances is set in the config.ini file for SciDB.  I do not know if this can be adjusted after the database has been initialized; I was starting from scratch so that was not an issue for me.  I used the basic "test1" configuration provided in the SciDB 12.10 documentation:
[ta]
server-0=localhost,3
....

I changed "test1" to "ta" for the name (in brackets), and the first line of the configuration file:  I changed the end of the line from a 0 (indicating one instance) to a 3 (indicating 4 instances).

Note:  I have effectively 4 cores on my system and the manual recommends 4 cores per 1 instance.  I broke this rule because I had perviously noticed that during matrix multiplication, only 1 of the 4 cores was in use.

Initial run with 4 worker instances

I took the initial estimate of the time to run (36 hours) and divided by 4 to estimate that the calculation would run in 9 hours now that I had 4x the processors.  After 11 hours I killed the calculation, assuming that it had hung.  To test that idea, I then measured made measurements of the time to run on the new system, to see if the calculation was really 4x faster.

Time to run calculation with 4 worker instances

System:  SciDB 12.10
Ubuntu 12.04 running on VMWare


Number of rows Time[s]
4 4.411
40 9.824
400 34.135
4000 199.285
40000 1855.251


(again note the logarithmic scale on the x-axis)

Similar functional form, roughly linear, although not strictly 4x faster - closer to 3x.  From the above function I then estimated that the calculation should run in 11.2 hours!!!
DOH!!!

I was 12 minutes away from completing an 11 hour calculation and I killed it.
(Edit / Update:  based on the piecemeal calculation I was actually more like 1.2 hours away from completion)

Ultra-crude "progress indicator" for matrix multiplication

This event made me realize that I need to have a progress indicator.  I am guessing that I was close to being done, but I don't really know.  I created a crude progress indicator by running the calculation in subarray sections, and timing each run.  When finished I can stitch the individual results together into the full result matrix.  Bash script to do this included below.

So far the subarrays of 4000 are running at just slightly over the previously measured time (199 s):
average:  232 s
standard deviation: 30s

This is almost within 1 standard deviation and thus possibly not "really" statistically different,
perhaps attributable to the fact that I am actively using the computer while the calculation is running.  Looking at the plot of the time each 4000 row subarray took to run confirms this:










     










There may be a slight trend upwards in the region ~18 to 34.

Update:  Data for complete calculation

The calculation ran in ~12.2 hours total.  Here is the a graph of how long each subarray ran in:





















Note the point at index 44 which has a much smaller run time.  From the full output, an error occurred during this run:

time iquery -naq "store(multiply(subarray(particleStem_3, 176000,0,179999,42314), eigVect_3), prinComp_3_44)"
SystemException in file: src/network/NetworkManager.cpp function: abortMessageQuery line: 860
Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR2
Error description: Network error. Connection error while sending.
Failed query id: 1103665169193

real    0m10.894s
user    0m0.000s
sys     0m0.036s

Sure enough, there is no prinComp_3_44 array present.  This demonstrates a pretty serious problem with using the bash script as I've written it - no exceptions, no error handling.  If I hadn't plotted the data I might not have noticed the problem, or I need to read through the entire output file to make sure there were no other failures.

Here are summary statistics of the above data, excluding point 44:

sum 44014 s 12.2 hr
average 202 s 3.4 min
standard deviation 24 s

In the bash script I printed the iquery command that was run each time along with the timing results, so it was easy to track down #44 and re-run.  It worked and the time was inline with the other measurements (197 s).

Conclusions

Increasing the number of workers by a factor of 4 increased the performance by approximately a factor of 3.  A very crude progess indicator of sorts was provided by running the calculation by subarray's of the left matrix, 4000 rows each in size.  The performance was roughly inline with expectations, taking into account that I was using the machine for other work that would have taken CPU cycles away from SciDB.

Appendix A:  Bash script for progress indicator / subarray matrix multiplication


#!/bin/bash



lastRow=873636



batchSize=4000





function runCalc {

resultArray=prinComp_3_$i



echo "time iquery -naq \"store(multiply(subarray(particleStem_3, $startRow,0,$endRow,42314), eigVect_3), $resultArray)\""

time iquery -naq "store(multiply(subarray(particleStem_3, $startRow,0,$endRow,42314), eigVect_3), $resultArray)"
}


let numBatches=$lastRow/$batchSize

for ((i = 0; i < $numBatches; i++))
do
let startRow=$i*$batchSize
let endRow=($i+1)*$batchSize-1
echo $i $startRow $endRow

runCalc

echo
done

let startRow=$i*$batchSize
endRow=$lastRow
echo last run:  $i $startRow $endRow
runCalc

9 comments:

  1. Welcome to the wonderful world of big math over big data!

    Some notes:

    1. Our initial implementation of SciDB (2 years ago) focussed more on the OLAP style array analysis, and on image processing. We treated our linear algebra support very much as a second class citizen. What you're working with now is our initial multiply, and it has a number of algorithmic and implementation issues. We're working to improve it.

    2. Having accepted the importance of linear algebra to SciDB users, we've changed our approach over the last 6 months. Rather than "rolling our own", we're working to integrate ScaLAPACK's distributed, parallel linear algebra into SciDB. The thought being that if you want to know how the "pros" handle things like big multiply, look at ScaLAPACK.

    3. What we've found, looking at a number of users, is lots of scale charts that look depressingly like the ones you have above. There's a problem with the physics here: one that will affect your estimator. Matrix multiply is inherently an O ( N ^ 3 ) operation. If you double N, you increase the run-time by a factor of 8. We've found that we scale about as well as you would expect--double N and double hardware for constant time--for many operations, but the scale curves for multiply will never look that good. What we do is bring the best possible engineering to the problem, and integrate the numerical ops with the transactional storage manager.

    ReplyDelete
    Replies
    1. Thanks for the info! I still want to investigate if I am doing something egregiously wrong. I swear I ran this calculation previously in in only ~5 hours. But I don't see anything obvious in the schema - the chunk sizes seem appropriate and similar to my previous run.

      The other nagging question I have about performance is that I estimated that I could run the same calculation on Oracle XE in 1.5 hours. But in this case it would violate the size limit. Still, I would expect SciDB to be faster.

      I should say I am very happy to have SciDB because previously running this calculation for me meant lots of text files containing the matrices to track. It is great to hear that you're going to use ScaLAPACK in SciDB!

      I should point out that I don't find the above graphs that depressing - the above plots have a log scale on the x-axis and I see that they are not very clear so I'm going to include the regular axis plots. The above are linear relations - the other way to think of matrix multiply is O (mnp) where m,n, and p are the 3 distinct matrix dimensions. In the experiments above I am just varying m and the corresponding time is pretty close to linear!

      Thanks again.

      Delete
  2. Re: "the same calculation on Oracle XE in 1.5 hours. But in this case it would violate the size limit."

    Yeah. You're looking at another engineering trade-off. The work to parallelize a linear algebra algorithm, and make it work out of memory, has the side effect of slowing it down for those situations where it can work in-memory. Also, rolling our own multiply() (as we've done) is never going to be as efficient as the LAPACK implementations of the same. Those guys have had 30 years of hand-tuning and buffing their code. To be honest, the fact that we're even within 10x of them--us = 5 hours, LAPACK inside Oracle XE at 1.5 hours--is a pleasant surprise and suggests we did OK. But it's time to concede that they're always going to be better and co-opt them as best we can.

    Re: "other way to think of matrix multiply is O (mnp) where m,n, and p are the 3 distinct matrix dimensions"

    Yeah. We're saying much the same thing. Your case is slightly simplified by the fact you're multiplying a matrix by a vector. Many of the cases we're running into use multiply operations on rectangular matrices - hence the O ( N ^ 3 ).

    Also - you got Alex P's note? I'll repeat his comment here in case anyone stumbles onto our discussion through The Great Google. The SciDB built-in multiply () operator takes a third parameter than allows you to control the choice of algorithm: multiply ( A, B, 'dense|sparse|iterative'). Given the nature of your data, Alex suspects (and I agree) that 'dense' will be best. Give it a shot and see how it goes.

    ReplyDelete
    Replies
    1. I fully agree that Oracle has had a huge head start. I should point out that I'm not using any specialized package / cartridge to do linear algebra, just simple joins and aggregations in SQL though.

      The memory issue is a great point, I had mentioned to a colleague that I suspected I was below the size where I really start to the advantage of using SciDB and that would be consistent with what you're telling me, since the current matrices could be loaded into memory.

      I am actually doing matrix multiplication with rectangular matrices, they are just not square, and I am varying the number of rows in the left matrix.
      I did get Alex's note, thanks! Initial result is by choosing 'dense" I saw a > 2x speed up. I'll do a more thorough investigation and write it up.

      Delete
  3. Oh! I saw "This returns a matrix which is 1x42315" and the subarray ( Source, 0, 0, 0, NUMBER ) in the multiply and saw a vector. Sorry.

    And just to catch up with you. You went from 1-4 instances and saw a 3x speed-up? And with the 2x speedup for the dense multiply?

    I *suspect* at the sizes you're working with that the dominant cost might be the subarray() and the store(). The subarray() operator is what we call a "shape changing" op. Its output is an array with different dimensions, and therefore a different chunking scheme, etc. Creating this new array represents a significant cost. Try a few store(join(subarray(articleStem_3, 0,0,0,42314), eigVect_3), prinComp_3) to calibrate your overhead. Or: store ( subarray(articleStem_3, 0,0,0,42314), Temp ) and then multiply ( Temp, eigVect_3 ). We're really geared for the multiply ( particleStem_3, eigVect_3 ) kinds of queries.

    Also - bug! How can we work with you to get a handle on that (apparent) crash at the 44th subarray query?

    Bit confused by something, BTW. I thought you were using the Oracle LA extensions because you wrote somewhere that your query would exceed some limit. If you're just writing SQL, why is there a size limit?


    ReplyDelete
    Replies
    1. Sorry, that is a bit confusing - the shortest case is when it is just a vector.

      Yep, going from 1 to 4 instances resulted in about 3x increase in speed. Then adding the specifier 'dense' was a further speedup of 2x - at around 10,000 rows. The full run (~870,000 rows) using 'dense' just finished in 8 hours (1.5x speed increase)

      I almost mentioned in my previous comment that I had noticed that it appeared that the bulk of the time was taken up in the store operation rather than the multiply. I haven't done a thorough check though.

      What information do you need about the 44th query? I can get you the actual data, configuration file, just let me know.

      The size limit is because I am using Oracle XE which is the free version. As part of it being free it is restricted to only 11 GB of data.

      Delete
  4. Re: Oracle XE - I see. Suggestion? Try the same thing on PostgreSQL or MySQL. The PostgreSQL/MySQL v. Oracle performance thing is within 10x on this kind of query. If SciDB can't beat the other open source engines by 10x on the size of data we're dealing with, then we have issues.

    Re: Timing - so 4 instances, using multiply( A, B, 'dense') completes in ~8 hours? And the bulk of the time was in "store()? Let's clarify a couple of things. . .

    1. Try running "count ( multiply ( ... ) ) ", rather than "store ( multiply ( ... ) )". That way, we'll be able to see what's in the store(), and what's in the multiply().
    2. How are you determining that "store()" is the long pole in the tent? If you're using OS tools, it's possible that what's going on is that the default cache is such that you're actually swapping. What you're looking at isn't "store()", so much as the fact our default memory usage configuration might be too small. By default, SciDB only asks for about 1G per server. When you ask "top", what's the %Mem usage on the SciDB processes? How much memory are they using? If you have < 75% memory usage overall, it might be a good idea to increase the SciDB cache. (smgr-cache-size=4192 - set it to 4G per instance).


    Re: Debug ...

    1. A bit of background. Each SchDB engine is actually two processes; the working process, and a "watchdog" intended to restart / recover the worker process goes down. (We will also handle things like quiescence, rolling upgrades, etc using the watchdog - just not yet).

    2. When the worker process went down, it should have done the following:
    i. Write a lucid and thoughtfully composed suicide note and placed it into scidb-stderr.log. Depending on how depressed it was, it might also have written something longer in scidb.log.
    ii. It should have dumped a core file in the 000/0 directory

    Would you be so kind as to look there first, please? I'll e-mail you privately with contact details.

    ReplyDelete
    Replies
    1. OracleXE: my initial decision was choosing whether to do this work in SciDB or MySQL (http://dllahr.blogspot.com/2012/10/migrating-from-oraclexe-choosing-new.html) and I chose SciDB for a number of reasons. But I think it would be good to setup and do a quick load of the data into an opensource relational database and see how it goes for the full calculation.

      Timing: yes, 4 instances ran the calculation in 8 hours. I initially thought the bulk of the calculation was in store() (based on tests at 10k rows), but then I tried the suggested calculation of count vs. store and the numbers came back very similar: for 10k rows of the same matrix, count took 5 min, store took 5.3 min. Memory usage: during the peak of the calculation it was around 20% per instance or 1.2 GB (~80% total for all SciDB instances).

      Debug: I found the query in the regular log file, but did not find the suicide note in any of the scidb-stderr.log files. I didn't find the core file (assuming it has "core" in the name). I did find an interesting section of the 1/scidb.log file that I read as indicating that instance 1 crashed during the calculation and was restarted (07:54:19,303):
      2012-11-13 07:54:14,022 [0x7fc5f3504700] [DEBUG]: Granted shared lock of array prinComp_3_44@1 for query 1103665169193
      2012-11-13 07:54:14,022 [0x7fc5f3504700] [DEBUG]: Array prinComp_3_44@1 was opened
      2012-11-13 07:54:14,022 [0x7fc5f3504700] [DEBUG]: Sending barrier to every one and waiting for 3 barrier messages
      2012-11-13 07:54:14,023 [0x7fc5f3504700] [DEBUG]: All barrier messages received - continuing
      2012-11-13 07:54:19,303 [0x7fc5f35247c0] [INFO ]: Start SciDB instance (pid=16295). SciDB version: 12.10.0.5773 (Cheshire-final). Build
      type: RelWithDebInfo. Copyright (C) 2008-2012 SciDB, Inc.


      Delete
    2. (to clarify: memory usage was observed for the full calculations, done either piecemeal or complete)

      Delete