Distractions: 2012

Saturday, December 22, 2012

Python code to simplify loading data into SciDB

Summary: I've written some Python code that simplifies the loading of data from a csv file into SciDB. The programmer specifies for each column in the csv file whether it should be an attribute or a dimension in the SciDB array, and then the code loads it as a raw array, creates the the destination array based on the provided specifications and the data loaded into raw, and then transfers the data from the raw array to the destination array.

I've added the code to this GitHub repository under the directory ScidbLoader:

https://github.com/dllahr/scidb_python_utils

TODO

when calculating dimension chunk sizes, need to scale by the number of attributes - currently assumes 1 attribute

MIT GPL

All code presented on this blog is Copyright (C) David L. Lahr in the year it was published and released under the MIT GPL:

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in the
Software without restriction, including without limitation the rights to use, copy,
modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the
following conditions:

The above copyright notice and this permission notice shall be included in all copies
or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE
FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.

Sunday, December 16, 2012

Python code to simplify reading SciDB data

I've written some code in python that uses the SciDB python connector to access data in a more straightforward manner. In summary, you submit a query and get back an iterator over the data. It is currently very incomplete, and needs:

~~Currently it just returns the attributes. It needs to also return the values of the dimensions.~~ Done!
Ability to reset the iterator - it can currently only be used once

The code is publicly available from this github repository:
https://github.com/dllahr/scidb_python_utils

Size of a polymer from a random walk

Summary: From statistical mechanics, the size of a polymer is generally estimated using the statistics of a random walk. Here I investigate the assumption that the size of the polymer is proportional to the distance between the start and end points of a random walk as it is generally taught in statistical mechanics.

Review of random walk in 1 dimension

Start at the origin of the x-axis (x = 0). At each step, there is a 50% chance of moving 1 unit to the right, 50% chance of moving 1 unit to the left.

Here are some examples of random walks:

For N steps, the probability of having ended up at position x is given by the binomial distribution:

(from the above page at Wolfram). The full width at half max of the above distribution is:
sqrt(# of steps)

Timing Matrix Multiplication in SciDB and Setting the Number of Worker Instances in SciDB and Running Matrix Multiplication Piecemeal

Summary: I am multiplying 2 matrices in SciDB. Previously I recorded the calculation ran in 5 hours, but now I am observing / estimating it to run in 33 hours. This post is a description of my investigation and my attempt to speed up the calculation by reconfiguring SciDB to use more processors, and ultimately running the calculation piecemeal so I could monitor its progress.

Update: Using a system with 4 worker instances instead of 1 decreased the time by approximately a factor of 3.

Further Update: adding the specifier 'dense' to the multiply command increased the speed further by a factor of 1.5

Background

I have 2 matrices in SciDB that I want to multiply:
particleStem_3 is 873637 x 42315
eigVect_3 is 42315 x 100

schema:
[("particleStem_3<count:double> [stem=0:873636,20,0,particle=0:42314,42315,0]")]
[("eigVect_3<value:double> [particle=0:42314,42315,0,eig=0:99,20,0]")]

Summary of loading data and fitting regression in R

I mainly learned about R from this post about doing linear regression in R:
http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/

Of course the R-manual is great:
http://cran.r-project.org/doc/manuals/r-release/R-intro.html

However R is a huge system and when you are getting started the entire manual can be daunting, which is why I prefer a tutorial to get me started, and then start looking up more functionality in the manual.

Here are some tips and tricks that I commonly use:

How I investigated and shrank Oracle XE

Or how I learned to stop worrying and log in as sysdba to reset the system password

Oracle XE has a limit of 11 GB on data, which I was close to using. However, the entire Oracle XE system was taking up 52 GB on disk. Here is how I investigated why it was so large and how I shrunk the size.

I started with google and didn't find anything obvious. I then went to the web-based dashboard / control for oracle XE:
localhost:8080/apex

and attempted to log in as system. I was informed that the password was expired. Google revealed that I should log in as sysdba to reset the password, and that I could log in in as sysdba and reset the password using:

running sqlplus.exe as Administrator
for username entering

/as sysdba

alter user system identified by newpass

I then logged into the dashboard / control and checked under the "storage" tab. This listed the tablespaces in the system.

Exporting from OracleXE

Here are 3 things I did to export from OracleXE with various levels of success:

Use SQLDeveloper to run a query, then save the results

pretty fast but failed for results above a certain size

Use SQL*Plus to run the query, spool the results to a file

works pretty well but has to display all of the results on the screen so that is slowdown
~5% CPU usage
58 minutes for 15.7 million rows

Use a custom written quick Java application to run the query and print the results - redirect the results to a file]

fast and efficient
~10% CPU usage
15 minutes

Matrix Multiplication in SciDB - Chunk size requirement and approximate speed comparison to OracleXE

Having loaded some data into SciDB, I now get to the heart of the matter: matrix multiplication. Here is what I have for matrix dimensions:

Matrix A: 870,000 x 42000
Matrix B: 42000 x 100

I want to do
C = A x B

C will be
870,000 x 100

SciDB makes this easy:
AFL% multiply(A, B)

will produce the result, but we really want to store it, and not display it, so rather than run from within iquery, issue a command using iquery that suppresses output:

scidb@ubuntu:~$ iquery -naq "store ( multiply (A, B), C)"

(much easier than SQL equivalent involving insert ... select with joins and aggregations)

***BUT THIS DIDN'T WORK***

Migrating from OracleXE to SciDB - the recommended way

This post follows from a previous post describing my initial data migraiton efforts:
http://dllahr.blogspot.com/2012/10/migrating-data-from-oraclexe-to-scidb.html

Thanks to Paul on SciDB forum for showing me another way to load the data without having to write my own script to generate the SciDB formatted data files. This method is probably much safer in that if the SciDB format changes I don't need to worry about updating my script.
http://www.scidb.org/forum/viewtopic.php?f=11&t=598

The recommended way

Migrating data from OracleXE to SciDB (outdated)

Edit: Read about the easier, recommended method here: http://dllahr.blogspot.com/2012/10/migrating-from-oraclexe-to-scidb.html

Migrating from OracleXE - choosing a new database (SciDB vs. MySQL)

I hit the 11 GB limit on OracleXE for my home project, so I needed to find a new database. I quickly settled on either using MySQL or SciDB.

MySQL is another relational database, but opensource / there is a community edition available for free without a size limit*

SciDB is a new type of database designed for "big data" and doing advanced mathematics on that data.

Here are the basic pros and cons I came up with for each based on my situation

MySQL:

pros:

well established
easy to install
easy (probably) to switch my hibernate code to use instead of XE
would learn about the a major Oracle alternative (haven't used it in ~5 years)

cons:

probably at best as fast as OracleXE for linear algebra operations, possibly slower
not sure how / if possible to run on multiple processors

SciDB

pros:

designed / optimized for linear algebra
learn something completely new
designed to be scalable and run calculations in parallel

cons:

experimental / not well established
not sure if it works with hibernate

Next step: timebox SciDB investigation

I decided to set aside 4 hours of work to see if I could get SciDB up and running. If it took longer than that, then I made the guess that I would continue to have chronic problems using it. In the meantime, as a backup I installed MySQL on my system.

System: Dell laptop running Win7. 8 GB RAM. Pentium with 8 "threads" (~= cores). I have VMWare player with an installation of Ubuntu 11.10.

SciDB runs on linux so all my work with SciDB was done on the Ubuntu running on the VMWare player.

I attempted to install SciDB v12.3 from binaries and could not resolve all the depedencies. They are specified for 11.04, so that may have been the issue.

Install from source code worked fine - it was relatively easy and fast (beat the timebox by a lot). I followed the instructions in the manual. There was a minor problem with the config file in the documentation, as was explained in this forum post:

http://www.scidb.org/forum/viewtopic.php?f=11&t=506&p=828&hilit=803#p828

You need to create a user called "scidb" with sudo privileges, and then when you execute commands log in as the scidb user.

I also wrote it up on the scidb forum:

http://www.scidb.org/forum/viewtopic.php?f=11&t=589

Decision: SciDB

At this point I decided to go with SciDB. It passed the install timebox test very well and I decided I would rather learn about the newer SciDB than MySQL.

Edit: Probably does not work with hibernate

It appears that SciDB only currently provides connectors for Python and C/C++. My plan therefore is to do a partial migration - the large, matrix-based data will be migrated to SciDB, while the relational information will stay in OracleXE. It probably would make sense to migrate the relational information to the Postgres instance that is associated with the SciDB system, but I'm going to just focus on the first part. for now.

Saturday, July 7, 2012

Messing with SWF (Flash files)

I wanted to extract the music from a flash file I was viewing, so I looked at the source code for the web page and found the link to the SWF file. I downloaded this directly using curl
curl -O http://address.goes.here

I tried to play it in mplayer, no dice, similar with ffmpeg. Some googling later revealed that I could either attempt to convert the full thing to video, or I could just extract what I needed. I went with the later, using SwfTools

(thanks to Doesn't Not Compute for the pointers)

Google found this older FAQ for SwfTools which I used - I followed the instructions under (4) and installed freetype and jpeglib first (in case I want to do more advanced stuff with SwfTools later). After I configured and compiled each of these, I configured and compiled SwfTools. I was then ready for fun.

I followed (14) from the FAQ to extract

I listed everything in my SWF:

swfextract downloaded.swf

From the list I saw the last entry was "[-m] 1 MP3 soundtrack". I was able to extract it with:

swfextract -m downloaded.swf
(result sent to output.mp3)

Success!

Wednesday, April 25, 2012

Speed of Light in Vacuum and in Solids; Cherenkov radiation

I'll start near the end (speed of light), and then work my way backwards, then jump to the finish (Cherenkov radiation). The speed of light in a vacuum is defined based on Maxwell's equations, which can be re-arranged to give a wave equation, yielding a velocity of the wave that is:
c = 1 / sqrt(εo * μo)

εo is the permittivity of free space. This constant is used to calculate the electric field at some distance (r) from a charge. It basically says if you have this much charge (Q), you get this much electric field (E).

μo is permeability of free space. Similar to εo, this describes how if you have this much current (I) you get this much magnetic field (B)

The speed of light in a linear dielectric material is determined by the permittivity of the material (ε) , and the permeability of the material (μ):
v = 1 / sqrt(ε * μ)

These have the same meanings as above, but apply within the material. For example, ε tells you if you have this much charge (Q) within your material, you can find this much electric field (E).

But the speed of light is not about charges or currents. It is about electric and magnetic fields oscillating, and that oscillation propagating far away (and long after) the original charges' and currents' motions stopped. So how do these constants come to define the speed of light? The short answer is Maxwell's equations. Basically, the infinitesimal story can be written as:

electric charge undergoes a small acceleration
perpendicular to the direction of motion, an electric field increases in magnitude
Maxwell's third equation states that the change of the magnetic field in time is the negative curl of the electric field. Leaving the mathematical details aside, the end result is that the increasing electric field from (2) causes an increasing magnetic field
Maxwell's fourth equation states an analogous relationship to (3): the change of the electric field in time is the curl of the magnetic field. Again, the increasing magnetic field causes an increasing electric field.
(3) and (4) then set up the propagation through empty space / vacuum.

In Maxwell's equations, the constant of proportionality that determines how much change in electric field (dE / dt) in time you get for some curl in the magnetic field (div B) is εo * μo. So the story is that the time change / response of one type of field (electric or magnetic) to the other type determines the speed of propagation. Now, this applies the same in materials - except that instead of having εo * μo describe that time response, we have ε * μ.

What about these material constants? Well, the short answer is that the microscopic charges / structures of the material determines how much electric field (E) you get for a given charge (Q). Generally these numbers (ε, μ) are greater than their vacuum counterparts. A way to think about this is to imagine a "test" charge within a material. This test charge will cause the microscopic charges within the material to be attracted / repelled. This rearrangement of charge mimics and amplifies the presence of the test charge, causing it to appear like the test charge is larger than it is, causing the electric field (E) to be larger.

We can apply the same story to understand the slower speed of light within the material. The adjusting electric field (of the wave) in the material now has to push on the microscopic charges and they have to re-arrange before the field can affect the magnetic field, and vice-versa, thus slowing down the propagation.

Cherenkov radiation

Cherenkov radiation occurs when a particle traveling near the speed of light (in a vacuum) enters a material. Radiation is emitted as the particle slows down to a speed less than the speed of light in that material. Note that a particle traveling with a constant velocity does not normally emit radiation; it is only in the case where the speed of the particle exceeds the speed of light in that material.

Thinking about the above, and Cherenkov radiation leads to this story: The Cherenkov particle is traveling near the speed of light in vacuum (c) and approaches a material where the speed of light is slower (v). Zooming way in to look at the particle, so that it is well separated in our field of view from the microscopic charges of the material, we see electric and magnetic fields behaving as they do in a vacuum in the immediate vicinity. However, further out, we see these electric and magnetic fields interacting with the microscopic charges of the material. The electric and magnetic fields do not (approximately) propagate beyond the nearby microscopic charges until the microscopic charges have time to re-arrange / respond (this is the from the discussion above of the difference between ε and εo). In fact, the time to rearrange is so slow that once it is within the material the Cherenkov particle is going to catch up to the electric / magnetic fields within the material. The Cherenkov particle is now "driving" in a concerted manner the electric / magnetic fields - leading the attack from the position of the van / wedge!

Consider in contrast a regular, non-Cherenkov particle, traveling at less than the speed of light in the material (v). The electric / magnetic fields in the material from this particle's motion propagate faster than the particle is traveling, so they outpace the particle. The microscopic charges have time to "equilibrate" / rearrange around the particle's motion. The difference here is that as some microscopic charges get pushed one way, others fill in the opposite way. There is no uniform motion of the microscopic charges, and hence no net emission of radiation.

Application to faster than light motion in a vacuum

Thanks to Joel for pointing me towards this via discussion about the incident where, at the OPERA experiment, they thought they had observed neutrinos traveling faster than the speed of light. A key paper (provided by Joel) was a discussion about how Cherenkov radiation should cause any neutrinos traveling faster than the speed of light to emit radiation (and/or particles, electron-positron pairs) and thus lose energy rapidly. But that mechanism then allows for supra-luminal velocities - and leads me to imagine stories like the above applied to vacuum conditions. A particle traveling faster than c has local, microscopic fields that propagate faster than c, and when they spread further out from the particle interact with the vacuum fields to cause Cherenkov radiation? What would the scale of these microscopic fields be? Planck? Some other characteristic wavelength of the vacuum radiation? Interesting to think about.

Sunday, April 8, 2012

Photons and their relation to the waves in the electric / magnetic fields

When I was in college and I first heard about the concept of photons in physics, I initially guessed that a photon would correspond to the field (electric / magnetic) between 2 nodes of the wave:

I was told this was not correct, and I left it at that, but recently I figured I would investigate why it was wrong. That is what this post is about.

The energy of a photon is given by:

E = h * ν

h is Planck's constant 6.626e-34 [J*s]

ν is frequency [Hz]

I looked up the energy carried by electromagnetic waves in my go-to book: "Introduction to Electrodynamics" by David J. Griffiths:

S = c * εo * Eo * cos^2(k*z - ω*t + δ)

c is the speed of light

εo is the permittivity of space

Eo is the (rms? max?) magnitude of the electric field

cos^2 is cosine squared

k is the wavevector, defined by the relationship between frequency and the speed of light

ω is the frequency of the radiation

δ is the phase

We can simplify this by assuming z = 0, δ = 0 - this just says we are looking at what happens at z = 0, and there is no phase offset..

S = c * εo * Eo * cos^2(ω*t)

This equation defines the energy per unit time, per unit area. So for the above, we would choose as our unit of time one cycle, or one half cycle of the wave. But that still leaves the problem of the area. Also, there is no classical restriction on the magnitude of the electric field (Eo). The question then becomes, for a given area, is it possible to have Eo be so low that the energy of the photon spans more than 1 cycle? There is nothing in the equations to prevent this. Is there experimental evidence of it?

This essentially comes down to "single photon" experiments - experiments in which photons are measured one at a time. I start by reading the wikipedia entry on the double slit experiment:

http://en.wikipedia.org/wiki/Double-slit_experiment

~~and will post my thoughts of this and other reading separately.~~

Update:

Single photon experiments with red photons from a He-Ne laser are not too hard to do:

http://ophelia.princeton.edu/~page/single_photon.html

The separation between individual photons is 2 km, which is much longer than the wavelength of the radiation (~700 nm) therefore, given the above framework the photon would be spanning billions of nodes!!

Saturday, April 7, 2012

Moving faster than the speed of light

I've been thinking about physics a lot lately, and I'm starting to jot things down, so I don't keep going over the same ground again but also to help me iron out the logical inconsistencies that can creep in when you do a problem in Caput. A lot of this is really me just thinking about physics that I've read, and doing thought experiments so I can understand it.

This post is thoughts about what would happen if you moved faster than the speed of light. Start with these premises:

The only way we know about a particle (anything really) is the effect / force that particle exerts on other particles.

You push a block with your hand: the electric / magnetic forces from the electrons in the atoms in the proteins / molecules in the cells in your hand interact with the electrons in the atoms in the molecules (cellulose) in the block of wood
The above example is for electricity and magnetism, but (I've read) applies equally well to other, more exotic forces - e.g. the strong nuclear force between quarks in an atom's nucleus

The forces between particles can be represented as fields. Fields are vectors that exist through space that indicate the force (magnitude, direction) that a "test" particle would experience if it were at that location

Imagine 2 charged particles. From Coulomb's law, we can calculate the force between them. Or, for each particle we could determine the field it generates throughout space. Then, the force on each particle is determined by the field generated by the other particle.

Movement of particles causes changes in the fields

As the location of 2 particles gets closer together, the force they exert on each other increases. Similarly, the field strength increases.

The changes in the fields propagates at the speed of light

The above might sound crazy, but they are well established physics, with tons of experimental evidence. Given the above it is almost nonsensical to talk about a particle moving faster than the speed of light. Which is somewhat expected - the above description of reality is based on the tenet that nothing travels faster than light. But the exercise of investigating what would happen if something moved faster than light helps me understand the relationships. So, 2 scenarios to imagine:

Particle approaches at faster than light

The particle will arrive at a location before the effect of the particle being at the location does. This is just logically inconsistent.

Particle moves away faster than light

This situation is harder to rule out. As the particle recedes, it is not arriving before its effect. The problem with this one occurs for two situations I can think of:

1. Imagine another particle, chasing this one. The "effective" location, based on the fields, is only moving at the speed of light. In this case, the particle has effectively "disappeared". The chasing particle sees only the location represented by the field

2. Imagine instead of a single particle, an atom moving faster than the speed of light. Background: for a stationary atom emitting radiation, the frequency is intrinsic to the motion of the oscillation of the electron(s) within the atom. The radiation, regardless of the relative velocity between the emitting atom and the observer, propagates at the speed of light. The wavelength is determined by the frequency and the speed of light.

Now, for the atom moving faster than the speed of light: Take the period of oscillation, imagine the first cycle has occurred. Now, in this period of time, the atom has traveled a distance greater than the wavelength of the radiation, and a new cycle occurs. So the separation in peaks / troughs between the first and second cycle is greater than the wavelength (as it would be defined for regular sub-luminal speeds). Furthermore, for the third cycle, the discrepancy is even greater. So, even though the atom is travelling at constant velocity, the radiation is continuously increasingly red-shifted (chirped down!). Effectively, as time goes on, the emission of radiation is red-shifted until it would disappear completely. Now, this description is discrete, but it could be made continuous.

Why would the above be impossible or inconsistent? Well, the particle, in this case, has effectively disappeared from the universe, since internally it is emitting radiation, but this vanishes / does not appear anywhere else.

The reverse of this is also possible to imagine, in which an atom emitting radiation approaches at faster than the speed of light, and the radiation is continuously increasingly blue shifted. In this case, leaving aside the issue from above of the particle arriving before its effect, the radiation observed would be increasingly blue shifted over time (chirped up!). Where is the increased power / energy coming from? Again, the internal state of the atom is disconnected from the rest of the universe.

Subscribe To Distractions

Saturday, December 22, 2012

Sunday, December 16, 2012

Saturday, December 8, 2012

Review of random walk in 1 dimension

Tuesday, November 13, 2012

Background

Monday, November 12, 2012

Tuesday, November 6, 2012