Non-negative Matrix Factorization

Posted by marco
Tue, 08 Jul 2008 17:00:00 GMT

Here it is a small ruby script that calculate the non-negative matrix factorization of a given matrix. I tried to mantain the notation consistent with the one used in the paper of Lee and Seung.

The script requires the GSL library.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

#!/usr/bin/ruby


require("gsl")



def nmf(v, col, thresh)
   r, c = v.shape

   # w * h = v
   w = GSL::Matrix.alloc(r, col)
   h = GSL::Matrix.alloc(col, c)
   initrand(w)
   initrand(h)
   dist = thresh
   iter = 0
   while(dist >= thresh)
      # multiplicative update rule
      dist, w, h = update(v, w, h)
      puts dist if (iter%50 == 0)
      iter += 1
   end   
   puts "Ended in #{iter} iteration(s) with dist=#{dist}"
   return w, h
end


def initrand(m)
   r, c = m.shape
   0.upto(r-1) { |i|
       0.upto(c-1) { |j|
          m[i,j] = rand()
       }
   }
end

def update(v, w, h)
   # choose an update rule
   # v, w, h are GSL::Matrix objects
   dist, w, h = update_eu(v, w, h)
   return dist, w, h
end

def update_eu(v, w, h)
   # multiplicative update rule
   # for minimizing euclidean distance
   # v, w, h are GSL::Matrix objects
   dist = dist_eu(v, w*h)
   wt = w.transpose
   ht =  h.transpose
   h = h.mul_elements(wt * v).div_elements(wt * w * h)
   w = w.mul_elements(v * ht).div_elements(w * h * ht)
   return dist, w, h
end


def update_kl(v, w, h)
   # multiplicative update rule
   # for minimizing divergence
   # v, w, h are GSL::Matrix objects
   dist = dist_kl(v, w*h)
   wt = w.transpose
   ht =  h.transpose
   num = v.div_elements(w * h)
   rh, ch = h.shape
   rw, cw = w.shape
   0.upto(rh-1) { |i|
       0.upto(ch-1) { |j|
          h[i, j] *= w.col(i).mul(num.col(j)).sum / w.col(i).sum
       }
   }
   0.upto(rw-1) { |i|
       0.upto(cw-1) { |j|
          w[i, j] *= h.row(j).mul(num.row(i)).sum / h.row(j).sum
       }
   }
   return dist, w, h
end



# Euclidean distance
def dist_eu(a, b)
   # a,b are GSL::Matrix objects
   dist = 0
   (a-b).collect { |d| dist += d * d }
   dist
end

# Kullback-Leibler divergence
def dist_kl(a, b)
   # a, b are GSL::Matrix objects
   dist = 0
   r, c = a.shape
   0.upto(r-1) { |i|
       0.upto(c-1) { |j|
          if (b[i,j] == 0) 
             dist = 2**32
             break
          end
          dist += a[i,j] * Math::log(a[i,j]/b[i,j]) - a[i,j] + b[i,j]
       }
   }
   dist
end

v = GSL::Matrix[[29, 31], [37, 41]]
w, h = nmf(v, 3, 10**(-10))
puts w
puts h
puts w*h

Tanimoto Coefficient

Posted by marco
Wed, 02 Jul 2008 16:40:00 GMT

A small Ruby snippet to calculate the Tanimoto coefficient (aka extended Jaccard coefficient) of two real vectors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/ruby

class Array
   def sum
      inject( 0 ) { |sum,x| sum+x }
   end
   def sum_square
      inject( 0 ) { |sum,x| sum+x*x }
   end
   def *(other) # dot_product
      ret = []
      return nil if !other.is_a? Array || size != other.size
      self.each_with_index {|x, i| ret << x * other[i]}
      ret.sum
   end
end

def tanimoto(a, b)
   dot = (a*b)
   den = a.sum_square + b.sum_square - dot
   dot.to_f/den.to_f
end

# puts tanimoto([1,2,2],[3,3,1])


Pearson Correlation Coefficient

Posted by marco
Tue, 01 Jul 2008 12:26:00 GMT

A small Ruby snippet to calculate the Pearson product-moment correlation coefficient:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

#!/usr/bin/ruby

class Array 
   def sum
      inject( 0 ) { |sum,x| sum+x }
   end
   def sum_square
      inject( 0 ) { |sum,x| sum+x*x }
   end
   def mean
      sum / size
   end
   def *(other)
      ret = []
      return nil if !other.is_a? Array || size != other.size
      self.each_with_index {|x, i| ret << x * other[i]}
      ret
   end
end


x = Array.new(100) {|i| rand(10) }
y = Array.new(100) {|i| rand(20) }


sumx = x.sum
sumy  = y.sum
num = (x.size * (x*y).sum) - (sumx * sumy)
den = Math.sqrt((x.size * x.sum_square) - sumx * sumx) * Math.sqrt((y.size * y.sum_square) - sumy * sumy)
r = num /den

puts "X= [#{x.join(', ')}]"
puts "Y= [#{y.join(', ')}]"
puts "r= #{r}"

ζ(s)

Posted by marco
Wed, 18 Jun 2008 13:19:00 GMT

Such a terse and strong abstract, and a really short paper, for an hypothesis that lasts unproved for a century. We shall see if the proof is correct.

Ruby Fiber Ring Benchmark

Posted by marco
Thu, 22 May 2008 17:51:00 GMT

I stole an exercise from the Erlang world: the ring benchmark

I used the exercise as a way to get acquainted with a new class in Ruby, Fiber (starting from the version 1.9). Fibers are a way to implement asymmetric coroutines in Ruby,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

#!/usr/bin/ruby1.9

require 'fiber'
require 'benchmark'

class Ring
   attr_reader :id
   attr_accessor :attach 

   def initialize(id)
      @id = id
      @fiber = Fiber.new do
         pass_message
      end
   end

   def |(other)
      other.attach = self if !other.nil?
      other
   end

   def resume
      @fiber.resume
    end

   def pass_message
      while message = message_in
         message_out(message)      
      end
   end

   def message_in
      @attach.resume if !@attach.nil?
   end

   def message_out(message)
      Fiber.yield(message)
   end

end

class RingStart < Ring
   attr_accessor :message
   def initialize(n, m, message)
      @m = m
      @message = message
      super(n)
   end
   
   def pass_message 
      loop { message_out(@message) }
   end

end


def create_chain_r(i, chain)
   # recursive version
   return chain if i<=0
   r = chain.nil? ? Ring.new(i) :  chain | Ring.new(i)
   create_chain(i-1, r)
end

def create_chain(n, chain)
   # loop version
   # needed to avoid stack overflow for high n
   n.downto(0) {
      chain = chain | Ring.new(n)
   }
   chain
end



n=ARGV[0].to_i
m=ARGV[1].to_i
mess=:hello
tm  = Benchmark.measure {
   ringu = RingStart.new(0, m, mess)
   chain = create_chain(n, ringu)
   m.times { ringu.message = chain.resume }
}.format("%10.6r\n").gsub!(/\(|\)/, "")

puts "#{n}, #{m}, #{tm}"
The comparison with the equivalent Erlang program is of course unfair, for one of the Erlang specialty is the spawning of cooperative processes. Anyway, here are the results:    
NMExecution Time
Ruby [ms]Erlang [ms]
10 10 1 0
10 20 2 0
10 50 4 10
10 100 10 0
10 200 18 0
10 500 42 0
10 1000 82 10
100 10 15 0
100 20 23 10
100 50 60 0
100 100 99 10
100 200 186 20
100 500 433 50
100 1000 805 90
1000 10 320 10
1000 20 423 20
1000 50 809 60
1000 100 1479 130
1000 200 2803 250
1000 500 6914 600
1000 1000 15410 1190
10000 10 16168 190
10000 20 17491 320
10000 50 21348 710
10000 100 28314 1480
10000 200 41547 2860
10000 500 81166 7200
10000 1000 147570 14190

Oil price

Posted by marco
Thu, 22 May 2008 00:13:00 GMT

I’d like to reiterate a past meme. This time I restrict the meme to one single prediction: please tell me wether you think the crude oil price will pass the 150$ per barrell in the very near future, and, if you think that of it’s only matter of weeks, dare to say the exact day. I venture myself to bespeak the 30th of June, 2008.

Carlo, Alx, what do you think?

Caching Mephisto with lighttpd and mod_magnet

Posted by marco
Thu, 15 May 2008 00:15:00 GMT

This blog uses Mephisto, a rails application, and Thin as backend webserver. The frontend webserver is Lighttpd.

While Mephisto has a good caching system, producing a static html file for every visited page, the file is still served via Thin, reducing the speed and the scalability of the system and introducing an unnecessary step.

Lighttpd is very good at serving static files, thus I searched a way to serve a file from the cache directory of Mephisto, if present; I found a hack using mod_magnet.

Hereafter I wrote the relevant snippets of configuration files and how to put every piece in the right position.

Note: the same procedure should apply if you use Mongrel instead of Thin, for the relevant point is to bypass the backend rails webserver and serve all files (whether they exist) from the Mephisto cache.

First I configure Lighttpd to proxy to Thin for the rails application. I assume Thin (or Mongrel) is listening on port 3000 on two hosts.

  HTTP[''host"] == "virtualhost.example.org" {
        magnet.attract-physical-path-to = ( "/var/www/cache_mephisto.lua" ) 
        proxy.balance = "hash"
        proxy.server  = ( "" => ( ( "host" => "10.0.0.1", "port" =>  3000 ),
                                ( "host" => "10.0.0.2", "port" =>  3000 ) ) )
  }

The magnet.attract-physical-path-to = ( "/var/www/cache_mephisto.lua" ) is the magic trick. cache_mephisto.lua is a script written in lua programming language. The whole script is:

  local dir =  "/path/to/mephisto/public/cache/" .. lighty.request["Host"]

  if ( lighty.env["request.uri"] == nil or lighty.env["request.uri"] == "/") then
    file = dir .. "/index.html"
  else
    file = dir .. lighty.env["request.uri"] ..  ".html"
  end

  if lighty.stat(file) then
  lighty.header["Content-Type"] = "text/html"
  lighty.content = { { filename = file } }
  return 200
  end

 return lighty.RESTART_REQUEST

Inside the script you should adapt the variable dir to your Mephisto installation path. I need to append the virtual host name lighty.request["Host"] to the directory, for I run a multisite Mephisto installation; if you use a single site mephisto installation, probably you need to remove this append.

Now for some benchmarks:

  • serving the file via Thin:

       Requests per second:    1.40 [#/sec] (mean)
       Time per request:       716.007 [ms] (mean)
    
  • serving the file from the cache

       Requests per second:    11.46 [#/sec] (mean)
       Time per request:       87.247 [ms] (mean)
    

Almost a tenfold improvement!

Future works: serve from the cache the image files.

US Economy

Posted by marco
Sun, 20 Apr 2008 22:06:00 GMT

A thoughtful article: why Us economy is collapsing

A really interesting movie: money as debt

Postgresql Graph

Posted by marco
Thu, 13 Mar 2008 12:19:00 GMT

I wrote a small tool, postgresqlgraph, to monitor and plot the statistics of a postgresql database, using rrdtool.

The tool is able to monitor every database created inside a postgresql installation.

The parameters plotted for each database in the graph are:

  • numbackends: number of backends:
  • xact_commit: transactions committed;
  • xact_rollback: transactions rolled back;
  • blks_read: blocks read from the disk;:
  • blks_hit: blocks hit in the cache;
  • tup_returned: tuples returned (only in postgresql 8.3)
  • tup_fetched: tuples fetched (only in postgresql 8.3)
  • tup_inserted: tuples inserted (only in postgresql 8.3)
  • tup_updated: tuples updated (only in postgresql 8.3)
  • tup_deleted: deleted (only in postgresql 8.3)

For each parameter the graph shows the maximum and average values, and plots the derivative (rate change).

The software is published under GPLv3 .

A demo is available here.

Please feel free to submit me bugs, critics, improvements, or tell me if you find the tool of any worthwhileness.

SNAFU

Posted by marco
Tue, 27 Nov 2007 17:00:00 GMT

I spent a frustrating week of problem solving (or at least trying to…) of obscure integration issues among different programs, like trying to get zabbix to talk with a jabber server (the solution was found in IPV6), reading unuseful logs to debug problems (what the hell do you mean with “connection failed”? To what?) , recompiling software with arcane options, rebooting servers with the swap space full (apache, what are you doing?), or trying to stop a server that continues to reboot itself at boot (Xen, please tell me why…), or fighting with buggy PHP frontend, and, last but not least, investigate why a monitoring software just stopped working.

So I come to wonder if it exists a software that “just works”. If you ever happen to read this rambling and know such a software, please, please tell me. In the meanwhile I coined a law:

Dela’s Law of software: there is no software that “just works”
Corollary to the Dela’s Law of software: if a software seems to “just work”, after some times it will stop to work for an obscure, unknown and very difficult to find reason.