Distractions: Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

Summary: I am multiplying 2 matrices in SciDB. Previously I recorded the calculation ran in 5 hours, but now I am observing / estimating it to run in 33 hours. This post is a description of my investigation and my attempt to speed up the calculation by reconfiguring SciDB to use more processors, and ultimately running the calculation piecemeal so I could monitor its progress.

Update: Using a system with 4 worker instances instead of 1 decreased the time by approximately a factor of 3.

Further Update: adding the specifier 'dense' to the multiply command increased the speed further by a factor of 1.5

Background

I have 2 matrices in SciDB that I want to multiply:
particleStem_3 is 873637 x 42315
eigVect_3 is 42315 x 100

schema:
[("particleStem_3<count:double> [stem=0:873636,20,0,particle=0:42314,42315,0]")]
[("eigVect_3<value:double> [particle=0:42314,42315,0,eig=0:99,20,0]")]

Note:

the chunk size of the inner dimension (particle) matches
the number of elements in the chunks of each array are ~1000000 as recommended

particleStem_3: 20 * 42315 = 846300

(okay, that is some serious handwaving)

eigVect_3: 42315 * 20 = 846300

My investigation focused on measuring the time of the calculation as it changed with increasing the number of rows in the left most matrix. I take a subarray of the left matrix using:

subarray(particleStem_3, 0,0,3,42314)

This returns a matrix which is 4x42315:

left most 0 to 3: 4 units of the first dimension
2nd 0 to 42314: 42315 units of the second dimension

I nested this subarray call within the matrix multiply statement, and stored the result. I timed the whole operation:


time iquery -naq "store(multiply(subarray(articleStem_3, 0,0,0,42314), eigVect_3), prinComp_3)"

Query was executed successfully



real 0m12.158s

user 0m0.028s

sys 0m0.012s

Time to run calculation using 1 instance

System: SciDB 12.03
Ubuntu 11.10 running under VMWare

I repeated this for increasing numbers of rows (removing prinComp_3 between runs, remove(prinComp_3)), to obtain this data:

number of rows	time[s]
4	13.654
40	25.9
400	85.702
4000	620.328

Note the log scale on the x-axis.

Conclusion based on timing with 1 worker instance

From the best-fit equation, I estimated that the entire calculation would run in 36.5 hours. Before I worked on investigating why this was taking so long (i.e. by posting on the SciDB forum to bug the hard working developers) I thought in the mean time I could:

upgrade to the latest version of SciDB
attempt to run the calculation in parallel by increasing the number of worker instances of SciDB

I upgraded to Ubuntu 12.04 and then was able to easily use the SciDB repository to install SciDB, and then some minor ssh configuration as outline in the manual. Just follow the documentation instructions they are not long.

Setting the number of worker instances

The number of worker instances is set in the config.ini file for SciDB. I do not know if this can be adjusted after the database has been initialized; I was starting from scratch so that was not an issue for me. I used the basic "test1" configuration provided in the SciDB 12.10 documentation:

[ta]

server-0=localhost,3

....

I changed "test1" to "ta" for the name (in brackets), and the first line of the configuration file: I changed the end of the line from a 0 (indicating one instance) to a 3 (indicating 4 instances).

Note: I have effectively 4 cores on my system and the manual recommends 4 cores per 1 instance. I broke this rule because I had perviously noticed that during matrix multiplication, only 1 of the 4 cores was in use.

Initial run with 4 worker instances

I took the initial estimate of the time to run (36 hours) and divided by 4 to estimate that the calculation would run in 9 hours now that I had 4x the processors. After 11 hours I killed the calculation, assuming that it had hung. To test that idea, I then measured made measurements of the time to run on the new system, to see if the calculation was really 4x faster.

Time to run calculation with 4 worker instances

System: SciDB 12.10

Ubuntu 12.04 running on VMWare

Number of rows	Time[s]
4	4.411
40	9.824
400	34.135
4000	199.285
40000	1855.251

(again note the logarithmic scale on the x-axis)

Similar functional form, roughly linear, although not strictly 4x faster - closer to 3x. From the above function I then estimated that the calculation should run in 11.2 hours!!!
DOH!!!

I was 12 minutes away from completing an 11 hour calculation and I killed it.
(Edit / Update: based on the piecemeal calculation I was actually more like 1.2 hours away from completion)

Ultra-crude "progress indicator" for matrix multiplication

This event made me realize that I need to have a progress indicator. I am guessing that I was close to being done, but I don't really know. I created a crude progress indicator by running the calculation in subarray sections, and timing each run. When finished I can stitch the individual results together into the full result matrix. Bash script to do this included below.

So far the subarrays of 4000 are running at just slightly over the previously measured time (199 s):
average: 232 s
standard deviation: 30s

This is almost within 1 standard deviation and thus possibly not "really" statistically different,
perhaps attributable to the fact that I am actively using the computer while the calculation is running. Looking at the plot of the time each 4000 row subarray took to run confirms this:

There may be a slight trend upwards in the region ~18 to 34.

Update: Data for complete calculation

The calculation ran in ~12.2 hours total. Here is the a graph of how long each subarray ran in:

Note the point at index 44 which has a much smaller run time. From the full output, an error occurred during this run:


time iquery -naq "store(multiply(subarray(particleStem_3, 176000,0,179999,42314), eigVect_3), prinComp_3_44)"

SystemException in file: src/network/NetworkManager.cpp function: abortMessageQuery line: 860

Error id: scidb::SCIDB_SE_NETWORK::SCIDB_LE_CONNECTION_ERROR2

Error description: Network error. Connection error while sending.

Failed query id: 1103665169193

real 0m10.894s

user 0m0.000s

sys 0m0.036s

Sure enough, there is no prinComp_3_44 array present. This demonstrates a pretty serious problem with using the bash script as I've written it - no exceptions, no error handling. If I hadn't plotted the data I might not have noticed the problem, or I need to read through the entire output file to make sure there were no other failures.

Here are summary statistics of the above data, excluding point 44:

sum 44014 s 12.2 hr

average 202 s 3.4 min

standard deviation 24 s

In the bash script I printed the iquery command that was run each time along with the timing results, so it was easy to track down #44 and re-run. It worked and the time was inline with the other measurements (197 s).

Conclusions

Increasing the number of workers by a factor of 4 increased the performance by approximately a factor of 3. A very crude progess indicator of sorts was provided by running the calculation by subarray's of the left matrix, 4000 rows each in size. The performance was roughly inline with expectations, taking into account that I was using the machine for other work that would have taken CPU cycles away from SciDB.

Appendix A: Bash script for progress indicator / subarray matrix multiplication

#!/bin/bash

lastRow=873636

batchSize=4000

function runCalc {

resultArray=prinComp_3_$i

 echo "time iquery -naq \"store(multiply(subarray(particleStem_3, $startRow,0,$endRow,42314), eigVect_3), $resultArray)\""






 time iquery -naq "store(multiply(subarray(particleStem_3, $startRow,0,$endRow,42314), eigVect_3), $resultArray)" 

}







let numBatches=$lastRow/$batchSize




for ((i = 0; i < $numBatches; i++))

do

 let startRow=$i*$batchSize

 let endRow=($i+1)*$batchSize-1

 echo $i $startRow $endRow




 runCalc




 echo

done




let startRow=$i*$batchSize

endRow=$lastRow

echo last run:  $i $startRow $endRow

runCalc

9 comments:

UnknownNovember 13, 2012 at 1:26 PM
Welcome to the wonderful world of big math over big data!

Some notes:

1. Our initial implementation of SciDB (2 years ago) focussed more on the OLAP style array analysis, and on image processing. We treated our linear algebra support very much as a second class citizen. What you're working with now is our initial multiply, and it has a number of algorithmic and implementation issues. We're working to improve it.

2. Having accepted the importance of linear algebra to SciDB users, we've changed our approach over the last 6 months. Rather than "rolling our own", we're working to integrate ScaLAPACK's distributed, parallel linear algebra into SciDB. The thought being that if you want to know how the "pros" handle things like big multiply, look at ScaLAPACK.

3. What we've found, looking at a number of users, is lots of scale charts that look depressingly like the ones you have above. There's a problem with the physics here: one that will affect your estimator. Matrix multiply is inherently an O ( N ^ 3 ) operation. If you double N, you increase the run-time by a factor of 8. We've found that we scale about as well as you would expect--double N and double hardware for constant time--for many operations, but the scale curves for multiply will never look that good. What we do is bring the best possible engineering to the problem, and integrate the numerical ops with the transactional storage manager.

ReplyDelete
Replies
Paul BrownNovember 14, 2012 at 3:05 PM
Re: "the same calculation on Oracle XE in 1.5 hours. But in this case it would violate the size limit."

Yeah. You're looking at another engineering trade-off. The work to parallelize a linear algebra algorithm, and make it work out of memory, has the side effect of slowing it down for those situations where it can work in-memory. Also, rolling our own multiply() (as we've done) is never going to be as efficient as the LAPACK implementations of the same. Those guys have had 30 years of hand-tuning and buffing their code. To be honest, the fact that we're even within 10x of them--us = 5 hours, LAPACK inside Oracle XE at 1.5 hours--is a pleasant surprise and suggests we did OK. But it's time to concede that they're always going to be better and co-opt them as best we can.

Re: "other way to think of matrix multiply is O (mnp) where m,n, and p are the 3 distinct matrix dimensions"

Yeah. We're saying much the same thing. Your case is slightly simplified by the fact you're multiplying a matrix by a vector. Many of the cases we're running into use multiply operations on rectangular matrices - hence the O ( N ^ 3 ).

Also - you got Alex P's note? I'll repeat his comment here in case anyone stumbles onto our discussion through The Great Google. The SciDB built-in multiply () operator takes a third parameter than allows you to control the choice of algorithm: multiply ( A, B, 'dense|sparse|iterative'). Given the nature of your data, Alex suspects (and I agree) that 'dense' will be best. Give it a shot and see how it goes.
ReplyDelete
Replies
Paul BrownNovember 14, 2012 at 9:26 PM
Oh! I saw "This returns a matrix which is 1x42315" and the subarray ( Source, 0, 0, 0, NUMBER ) in the multiply and saw a vector. Sorry.

And just to catch up with you. You went from 1-4 instances and saw a 3x speed-up? And with the 2x speedup for the dense multiply?

I *suspect* at the sizes you're working with that the dominant cost might be the subarray() and the store(). The subarray() operator is what we call a "shape changing" op. Its output is an array with different dimensions, and therefore a different chunking scheme, etc. Creating this new array represents a significant cost. Try a few store(join(subarray(articleStem_3, 0,0,0,42314), eigVect_3), prinComp_3) to calibrate your overhead. Or: store ( subarray(articleStem_3, 0,0,0,42314), Temp ) and then multiply ( Temp, eigVect_3 ). We're really geared for the multiply ( particleStem_3, eigVect_3 ) kinds of queries.

Also - bug! How can we work with you to get a handle on that (apparent) crash at the 44th subarray query?

Bit confused by something, BTW. I thought you were using the Oracle LA extensions because you wrote somewhere that your query would exceed some limit. If you're just writing SQL, why is there a size limit?

ReplyDelete
Replies
Paul BrownNovember 15, 2012 at 11:14 AM
Re: Oracle XE - I see. Suggestion? Try the same thing on PostgreSQL or MySQL. The PostgreSQL/MySQL v. Oracle performance thing is within 10x on this kind of query. If SciDB can't beat the other open source engines by 10x on the size of data we're dealing with, then we have issues.

Re: Timing - so 4 instances, using multiply( A, B, 'dense') completes in ~8 hours? And the bulk of the time was in "store()? Let's clarify a couple of things. . .

1. Try running "count ( multiply ( ... ) ) ", rather than "store ( multiply ( ... ) )". That way, we'll be able to see what's in the store(), and what's in the multiply().
2. How are you determining that "store()" is the long pole in the tent? If you're using OS tools, it's possible that what's going on is that the default cache is such that you're actually swapping. What you're looking at isn't "store()", so much as the fact our default memory usage configuration might be too small. By default, SciDB only asks for about 1G per server. When you ask "top", what's the %Mem usage on the SciDB processes? How much memory are they using? If you have < 75% memory usage overall, it might be a good idea to increase the SciDB cache. (smgr-cache-size=4192 - set it to 4G per instance).

Re: Debug ...

1. A bit of background. Each SchDB engine is actually two processes; the working process, and a "watchdog" intended to restart / recover the worker process goes down. (We will also handle things like quiescence, rolling upgrades, etc using the watchdog - just not yet).

2. When the worker process went down, it should have done the following:
i. Write a lucid and thoughtfully composed suicide note and placed it into scidb-stderr.log. Depending on how depressed it was, it might also have written something longer in scidb.log.
ii. It should have dumped a core file in the 000/0 directory

Would you be so kind as to look there first, please? I'll e-mail you privately with contact details.
ReplyDelete
Replies

Add comment

Distractions

Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

Background

Time to run calculation using 1 instance

Conclusion based on timing with 1 worker instance

Setting the number of worker instances

Initial run with 4 worker instances

Time to run calculation with 4 worker instances

Ultra-crude "progress indicator" for matrix multiplication

Update: Data for complete calculation

sum 44014 s 12.2 hr

average 202 s 3.4 min

standard deviation 24 s

Conclusions

Appendix A: Bash script for progress indicator / subarray matrix multiplication

9 comments:

Contributors

Blog Archive

Subscribe To Distractions

Tuesday, November 13, 2012

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

Background

Time to run calculation using 1 instance

Conclusion based on timing with 1 worker instance

Setting the number of worker instances

Initial run with 4 worker instances

Time to run calculation with 4 worker instances

Ultra-crude "progress indicator" for matrix multiplication

Update: Data for complete calculation

sum 44014 s 12.2 hr average 202 s 3.4 min standard deviation 24 s

Conclusions

Appendix A: Bash script for progress indicator / subarray matrix multiplication

9 comments:

Contributors

Blog Archive

sum 44014 s 12.2 hr

average 202 s 3.4 min

standard deviation 24 s