αρσ (alpha rho sigma)

Thursday, September 3, 2009

Phantom / Private Opera

One can almost take for granted a feature named “Private browsing” in Firefox or “Incognito” in Google Chrome and their similar counterparts in other browsers such as IE8 and Safari. It was a surprise to me that Opera 10 still doesn't have this feature. I searched for it in the Opera Forums, but I could find only a overcomplicated script that went like this:

cp ~/.opera ~/.opera.bak opera rm -rf ~/.opera mv ~/.opera.bak ~/.opera

This way, any changes that occur during the private session will be deleted and the ~/.opera directory will be restored to how it was. It works fine, but I don't feel like copying 40MB around every time I want to browse in private mode.

Other option would be disable cookies, history, disk cache, etc. The problem with this approach is that there isn't a quick way to go through this options and you have to remember doing and undoing them every time you want to be in private mode.

The solution I'm using is creating a new directory, named ~/.opera/phantom and changing Opera's configuration directory with the -pd (“profile directory”) flag. The actual name of the directory (and where it is) doesn't matter much. I put it inside ~/.opera so everything gets together, but it has no connection to the profile being used in ~/.opera. One could also use something such as ~/.foo. So after creating the directory you want to use, create a little shell script:

#!/bin/sh opera -pd ~/.opera/phantom

Give it a name easy to remember (I'm using operap) and add executable permission to it: chmod +x operap.

Now start operap and configure the options so it won't accept cookies, save visited websites, save cache in disk, etc. Don't forget also to change the initial page -- by default it's “open last time tabs”, so change to show Speed Dial, or something like that.

Note: “Private mode” wasn't made to make you safer in the Internet, but to prevent people that have physical access to your computer to see what you've been browsing. Suppose you and your wife/husband share the same computer and you want to make them a surprise. If the other person saw your browser history, they would be able to discover the surprise and it wouldn't be fun anymore. OK, I'm lying: the main use of private browsing is porn.

Friday, August 28, 2009

Benchmark: XFS vs ext4

I guess by now everybody has read about the files trucated to 0 bug of ext4. As a precaution, I turned on the nodelalloc option in fstab just after I installed Arch Linux in my recently bought notebook. Everything is running fine, safety and speedwise (especially considering my partitions are encrypted with LUKS and my password is ********... ops!), but I can't compare it to anything, as I've never used any other filesystem in this computer, and when it arrived I was too anxious to run benchmarks and more benchmarks to decide what filesystem to use.

There is one thing weird with it, though. After a certain number of mountings, both my root and my home partition get checked. While this isn't strange at all, the information this process prints in my screen is: both filesystems are around 20% non-contiguous. To be honest, I'm being a dick here (I'm sorry if your name is Dick); there is no noticeable difference in performance that I can tell, but still, 20% is a big number and I thought that defragging this partitions would be a good thing to do.

The reason I say would is there is no official defragger for ext4. What now, Jose?

I remembered the days I used to use XFS and it had an online defrag. I also remember it was terribly slow with small files, so updating my system with pacman was frigging slow. I also remember that, since then, I've read a lot about filesystems and optimizations and that I've never tried using noatime with XFS. I searched some pages about optimizing XFS and it's incredible how much better it can get with some simple options during it's creation and some other options that should go in fstab .

That being said, I'll try it in a spare partition I have, and maybe change my system to use XFS (hopefully, I'll do it without having to reinstall anything). Wish me luck and let LVM be with you.

Benchmarks

So I decided to make some benchmarks before changing my partitions. I , my . Don't ask for benchmarks of other filesystems, I won't do it. I chose to test only ext4 and XFS, as Reiser3 and ext3 are dated filesystems (my backup is ext3, though) and other bechmarks showed that JFS doesn't have the performance I expect. It may use less CPU time, but whatever. BtrFS is still being developed and other filesystems don't seem to be ready, too.

OK... The actual reason I didn't want to make a lot of tests is that I'll use my only computer to do them, but I want these tests to be honest , so I'll run them with the minimal number of processes running. This isn't per se a problem, but what am I supposed to do during the tests? Shave? So I decided to do only three tests: ext4 with delayed allocation (without nodelalloc ), default XFS, optimized XFS. The reason I didn't optimize ext4, is that I didn't find any nice text about it. There are only small modifications, such as using noatime , but this applies to all filesystems.

I have a spare partition here of 15GB that I'll use for the tests. My home partition is a lot bigger than that, but that's what I have. Besides, my root partition is a little smaller than that, so although the tests won't be a good representation of my home partition, it'll be a very good one of my root partition. This test partition won't be using LVM nor cryptography and these are the options I've used:

mkfs.ext4 /dev/sda3 mkfs.xfs /dev/sda3 mkfs.xfs -l lazy-count=1,size=128m /dev/sda3

The ext4 partition uses the defaults of Arch Linux:

[defaults] base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr blocksize = 4096 inode_size = 256 inode_ratio = 16384 [fs_types] ext4 = { features = has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize inode_size = 256 }

The optimized XFS has two different options related to logs: by default, XFS uses a log of 22MB, but its performance increases with bigger logs, so we use 128MB. Also, XFS tries to keep counters in superblocks always up to date, but this information can be retrieved only when necessary. Turning on lazy-count, we avoid some disk writes.

All three filesystems will be mounted with noatime . Ext4 and default XFS won't have any other option specified beyond this one. The optimized XFS will be mounted with two other options: logbufs=8, which increase the number of log buffers from 2 to 8, and logbsize=256k which increases log buffers size from 32KB to 256KB. This increases memory usage, of course, but I have 3GB. It won't be 2MB that will make me run out of memory.

Note: I've decided to benchmark only the optimized XFS. When I finished the first set of benchmarks, I was already pretty bored.

Kernel
Every time I see a benchmark of file systems, there is a test like this. So I'll do it, too! I first extract the contents from the kernel .tar.bz2 and then I copy the folder to another place in the same partition and then rm everything.

#!/bin/sh cp /home/andre/code/linux-2.6.30.5.tar.bz2 /media/bench/ cd /media/bench tar -xjf linux-2.6.30.5.tar.bz2 cp -R linux-2.6.30.5 lunix rm -rf linux-2.6.30.5{,.tar.bz2} lunix

Pacman
Pacman is the package manager of Arch Linux. It handles a lot of small files, so I guess this a pretty good test. I run pacman -Syy -b /new/partition, which generates a database in a different folder from the default and then I search it for three packages: kernel26, gimp and ncmpcpp. There is a problem with this test, though: the first part of it depends on the network, too. Statistically, running this test a few times should minimize the differences.

#!/bin/sh pacman -Syy -b /media/bench pacman -Si kernel26 -b /media/bench pacman -Si gimp -b /media/bench pacman -Si ncmpcpp -b /media/bench

With you... The Beatles!
I decided to use The Beatles' discography as a test for a big number of files of medium size (~4MB). My musical collection is much bigger than that, but I guess we can get some good information from this.

#!/bin/sh cp -R /home/andre/media/music/The\ Beatles /media/bench/Beatles cp -R /media/bench/Beatles /media/bench/Rutles rm -r /media/bench/Beatles /media/bench/Rutles

Moving a disk around
I guess the biggest file I have in my computer is a virtual disk of a virtual machine, so I use it to test the performance of the filesystem with big files (~2.2GB).

#!/bin/sh cp /home/andre/.local/.VirtualBox/HardDisks/arch.vdi /media/bench cp /home/andre/.local/.VirtualBox/HardDisks/lose32.vdi /media/bench cp /media/bench/arch.vdi /media/bench/arch2.vdi cp /media/bench/lose32.vdi /media/bench/lose64.vdi rm /media/bench/{arch,arch2,lose32,lose64}.vdi

Results

All the scripts above were run 5 times, then I calculated the mean and the sanitized mean (the mean without the highest and the lowest value). Actually, I used a python script that can be found at the end of this article.

To be honest, I thought XFS would perform much better. With medium and large files, XFS got really close to ext4, but it was never faster than ext4, and it was almost three times slower when handling with the kernel files.

After these tests, I'll keep ext4 for longer, as probably there's no other match to it.

=KERNEL ./time-it 5 ./kernel.sh ext4 52.2324 42.7818 42.7238 51.5118 42.6375 mean: 46.3774 sanatized mean: 45.6725 xfs-opt 147.237 164.076 188.027 148.23 134.843 mean: 156.483 sanatized mean: 153.181 =PACMAN ./time-it 5 ./pacman.sh ext4 22.139 5.6207 4.98342 5.74287 5.2469 mean: 8.74658 sanatized mean: 5.53682 xfs-opt 8.51362 14.2627 15.3344 14.9117 14.0883 mean: 13.4222 sanatized mean: 14.4209 =BEATLES ./time-it 5 ./beatles.sh ext4 292.881 306.477 297.082 303.119 301.018 mean: 300.115 sanatized mean: 300.406 xfs-opt 294.875 301.987 297.663 304.623 301.512 mean: 300.132 sanatized mean: 300.387 =VIRTUAL ./time-it 5 ./virtual.sh ext4 435.735 432.042 439.759 446.086 502.91 mean: 451.306 sanatized mean: 440.527 * I opened firefox during this one, so maybe it would be better to consider only the 4 first times. mean: 438.406 xfs-opt 444.722 441.038 432.156 432.414 447.277 mean: 439.521 sanatized mean: 439.391 =USAGE ext4 /dev/sda3 11535376 159680 10789728 2% /media/bench xfs-opt /dev/sda3 11588344 4256 11584088 1% /media/bench =NOTES ext4: with both BEATLES and VIRTUAL, the system got really slow. xfs-opt: the same.

time-it.py

#!/usr/bin/env python import time import sys import subprocess import math if len(sys.argv) < 3: print "time-it.py " exit() runs = int(sys.argv[1]) command = sys.argv[2:] def mean(lst): global runs return (math.fsum(lst) / float(runs)) def san_mean(lst): global runs lst.sort() return (math.fsum(lst[1:-1]) / float(runs-2)) time.sleep(2) count = 0 timing = [] while count < runs: t1 = time.time() subprocess.call(command) t2 = time.time() timing.append(t2-t1) count += 1 time.sleep(1) count = 0 print " ".join(command) while count < runs: print "%g" % timing[count] count += 1 print "mean: %g" % mean(timing) print "sanatized mean: %g" % san_mean(timing) print

Friday, August 21, 2009

Combinatorics are tricky

One of these days I was helping a friend of mine with a relatively simple Combinatorics problem. It goes like this: using the digits from 0 to 9, how many 4-digits natural numbers we can write? There may be repetition of digits, but only if they aren't adjacent: 2424 is a valid number, but 2244 isn't.

The way I solved it was something like this:

The first digit may be any digit (except 0 or it'll be a 3-digits number actually), so there are 9 possibilities.
The forth digit may be any digit, so there are 10 possibilities.
The second number may be any digit, except the first one, so there are 9 possibilities.
The third number may be any digit, except the second and the fourth, so there are 8 possibilities.

Multiplying it all, we have: 9 × 9 × 8 × 10 = 6480.

According to the book my friend was reading, the answer was 6561, but there wasn't an explanation of why. I suspect the author of the book did something like this:

The first digit may be any digit, except 0.
The second digit may be any digit, except the previous.
The third digit may be any digit, except the previous.
The fourth digit may be any digit, except the previous.

This way of thinking will yield 9^4 = 6561 and it also seems to be correct. In the doubt of who was right, I did a little Haskell program to check the answer: given the numbers from 1000 to 9999, filter the ones that have adjacent numbers equal and print the length of the list.

module Main
    where

import System.IO

num = [1000..9999]

main = putStrLn $ show $ length $ filter (adjp) num

adjp n = if fd == sd then False
         else
             if sd == td then False
             else 
                 if td == fd then False
                 else True
    where
      fd = (mod n 10)
      sd = (mod (div n 10) 10)
      td = (mod (div n 100) 10)
      ft = (mod (div n 1000) 10)

I don't like the nested ifs and elses, there must be a more elegant way of writing this, but I wrote it quickly and it works, so whatever. If you run it, it will print 6480, meaning that I was correct, but the other way of solving it also seems reasonable, so why it's wrong?

Saturday, August 8, 2009

The “Linux is only free if your time has no value” myth

Or if “Linux is only free if your time has no value” then Windows is more expensive than it seems.

My first contact with Linux was around 5 years ago, using a Brazilian distro named Kurumin meant to be used as a LiveCD and based on Knoppix/Debian. One year later, I decided to install Slackware on my computer, and as I had a small HD by the time, it became my only operational system.

Tired of having to keep track of dependencies and the like, I replaced it with Arch Linux, my distro of choice for the last years (and for the next years, too). While I am relatively new to Linux (I have never used a kernel from the 2.4 series), I'm used to “hard to use” distros, and I fear no text command.

During this time, I did “waste” my time configuring X, trying different filesystems, partition schemes, window managers... but I don't see it as a waste of time. It was time spent learning something. One may ask “what's the use of learning a OS that almost nobody uses?”, but that wasn't the only thing I've learnt. Throughout these years I've used Linux, I also improved my English and my programming skills, I know more about how computers work (and how they don't), I've started worrying about my privacy online, I've learnt how to be a minority sucks, I've learnt to RTFM... Some of this might not be useful from a professional point of view, but they helped to develop the person that I am today.

When the time to go to college came, I chose to take the Electrical Engineering course. As some of you may know, the Engineering field uses some proprietary software, as AutoCAD, Matlab and others. I had no choice but to learn how to use them. I thought “I'll get educational licenses from the college and install it in a virtual machine. How bad can it be?”. It turned out to be a lot worse than I thought at first.

First step: Install Windows XP
I didn't measure how long it took to install Windows XP, but assume it's about the same time it takes to install a Linux distro, so installing both Linux and Windows “costs” the same, right? Yes, except by the fact that after installing Linux, you have a office suite installed, an image editor, music and video players, an editor with syntax highlighting, a C compiler, a Python interpreter. Some less ideological distros already come with Mono and Java preinstalled.

When you first install Windows, there's almost nothing there. Now you have some options:

Install the software you need from a CD, and possibly type a key code.
Search and download a freeware version from the internet.
Search and download illegal software and search for cracks in the internet.

Install from CDs is bad, as you have to insert it, click “next” some times, type the key code, exchange CDs... Searching for freeware and downloading takes some time, and then you have to click next some times, too. Install cracked software is even worse, as it involves some risk. How good would it be if you could just select some software from a list and then leave to drink some coffee and let the computer doing it's own job -- download and install software? That's how installing software on Linux works.

I think it's clear by now that installing Windows is more time-expensive than installing Linux, but this comparison isn't complete. If the Windows installation we're talking about is OEM, some software may have been installed for us. Sometimes even software we don't want already come installed and removing them is usually cumbersome. On the other hand, if this installation is a “normal” one, now you probably have to install some drivers, but what's your hardware? On Linux, a simple “lspci” describes your computer; if you need more information, use the “-v” flags. On Windows, though, there's no such tool, you have to search for one and install. Now, with some program as Everest or similar, we can start searching for drivers.

I'm in your computer, stealing your CPU time
Of course, no sane person would ever download and install something on Windows without running an anti-virus software and probably a firewall, too. What this means is that there's a software scanning all your connection while you search the web and another one scanning your files while you install something, so the time it takes to search for the software you need and to install is bigger than it would be if there were no need for that.

A problem arises...
Things worked fine for about two weeks, then suddenly AutoCAD stopped working. It could have been any other software, it just happens that it was AutoCAD. I tried it again, and it crashed again. I rebooted the machine, and it happened again. I'm not saying this kind of thing doesn't happen with Linux, but I still can try to run the app from a terminal, see if it's segfaulting, or if it's searching for a file that can't be found.

Unaware of what else could I do, I did what any other Windows user would do: reinstall the program. Now, ask me if it worked. No, it didn't. I decided to try it again, manually deleting some files from the “Documents and Settings” folder, and manually deleting any entry in the registry that could have been left behind. I gave up the idea of trying to understand what happens in the registry and decided to install a registry cleaner, but now, whenever I try to access a shared folder, explorer will simply crash.

From that, what I've learnt is that Windows is expensive, even you're using a free educational license.

Notes

Because of laziness, throughout this article I wrote Linux when I meant the GNU/Linux OS. Don't get mad at me, RMS.
I compared Linux with Windows because Macs are rare here. From what I've seen, some of what's written here may be applied to Mac OS X, too.

Thursday, August 6, 2009

Writing more efficiently

I've been thinking in a way to write faster and with less movements of my arm. It may seem rather useless, as most of what I write is using a keyboard anyways, but my Statistics classes got me thinking another way. Besides, handwriting still has its uses, even when it's not cursive (I don't use cursive since I was... 11, I guess), so if you're going to write a quick note, why not write it quicker?

One last reason is that it's different. This is a quite personal reason, (that's why this is a blog, anyway), but I really like being different. I became aware of this tract of my personality recently, it's rather subconscious.

OK, 'nuff off-topic. I'm not completely satisfied with the system I've developed, but I'll document here my first steps.

First, remember that I'm Brazilian, so when I'm handwriting, I'm writing in Portuguese. I do write a lot in English, but I use the keyboard for that. The reason I'm emphasizing this is that Portuguese has a good equilibrium of vowels and consonants (for those who aren't used to it, it's quite similar to Spanish). The method described here will work if other languages, but if your language isn't so vocalic as Portuguese, it may work not so well.

Keep in mind also that I'm reading Tolkien at the moment (The Silmarillion), so now you'll see from where the idea came from.

The method
The idea of using Tengwar as it was created was something that I didn't actually like; I would have to learn a new writing system and them adapt it to Portuguese. It wouldn't be that hard, I remember I did it some time ago, though I forgot most of it, and I don't have the papers anymore, but I'd like something simpler.

My idea was to take the basic idea of Tengwar (consonants with diacritics indicating the vowels) and adapt it to the Roman alphabet. Without even thinking much about it, we can see it has it's advantages:

It's more "dense", I can write the same stuff using less space
I don't have to learn new symbols as I would with Tengwar or some writing techniques that use ideograms for common words.
As the diacritics are written above and not beside, I know how much space a word will take by writing only about half of its characters, so I can write a complete sentence in a run and then put the diacritics.

How it works
The rules I (arbitrarily) decided are that a diacritic over a consonant means it's followed by an vowel. Each diacritic means a different vowel, being:

"^" - a.
"´" - e.
"·" - i
"°" - o
"¨" - u

Although they're arbitrary, I tried to make them related to their meanings:

I added one other for nasal diphthongs (in Portuguese, they are "-ão", "-ãe" and "-ãos"), which is represented by "~" (obvious, right?). I'm also using one character from the Cyrillic alphabet to represent roughly the same phonem: щ.

Actually, It's not as simples as that. There are two ways to write this phonem /ʃ/, like in this phrase, "Xícara de chá" (cup of tea), both "x' and "ch" represents the same phonem, but when I'm writing, I keep "x" as "x" and replace "ch" by "щ". Replacing "ch" by "x" feels... wrong.

Some improvements that can be made.
Some words in Portuguese start with a vowel, but there's no way to represent these vowels using diacritics as there's no other character to put it over. I could use a "null consonant", probably a dotless i, and use the diacritics over it. I don't like this solution very much, but I'm thinking about this, yet.

Another problem are diphthongs and triphthongs. Currently, I've been putting a diacritic over the other, in a recursive fashion: if b+^ means "b followed by an a", then "b+^"+· means "b followed by an a followed by an i". I don't like it as it easily gets unreadable, so I have to take care with it.

The problem of this approach is that I have no way to differentiate hiatus from diphthongs. Notice the diacritic over the i: "cai" (he/she falls) and "caí" (I fell). In the last word, the "i" is pronounced a little longer. A quick hack to solve this would be to represent diphthongs as they are now, and use the null consonant to represent hiatus, so it would be:

The last problem I can think of by now is that some common digraphs could have their own symbols. Almost all consonants can be followed by an r in Portuguese, so I could think in a way to represent this encounter. Another common case is the nasalization of a vowel by putting an n or and m after it. I can't use a simple ~ as all vowels can be made nasal, so only a tilde would be ambiguous. There other cases such as ch (see above), nh and lh.

Drawbacks

It's only efficient on paper. I can't write the same way using a keyboard (although my keyboard settings accepts weird combinations such as ś, not all consonants can have a diacritic) and even if I could, it wouldn't help, as I'd had to press basically the same number of keys.
It's rather simple, so I guess there wouldn't be a big problem teaching someone to read it. At the same time, it's quite similar to the "normal" way of writing, so it can't be used for cryptic writings.

Conclusion
As I said, I'm not completely satisfied with it, but considering the idea was created and developed in some hours, I like it so far. I haven't put it in heavy tests yet, so I can't say how efficient this really is. Some quick tests with phrases that came out of my head showed that I write something between 60% and 70% of the original number of characters. Of course the improvement is not so big as I have to write diacritics now, but I already had before (as I exemplified with "cai" and "caí"), and considering the vowel i, there's always this dot, so I just had to change its place.

First!

Hi... so this is my first post. I'll write about things I like, which are mainly programming and linguistics. This post will be like my traversing of the Bridge of Death.

An old man appears...
- You shall answer me three questions before you may pass! What's "alpha rho sigma"?
These are my initials if you write them using their Greek equivalents.

- What is 0x3a29?
It's a number, d'oh! Actually, it's the ASCII code to ":)".

- Why starting this blog?
I already have a blog, but it's in Portuguese. There are some problems with it, though: my more technical posts aren't read, and lots of my posts are only read in certain months (when people are "doing" their homework about the subject). Besides, I'm not part of the public of my blog; I believe that almost all programmers look for help in English when they need something more specific, myself included, so I saw no reason on writing advanced topics in Portuguese.

Another reason is that recently I created an account in Identi.ca and in Twitter and I need somewhere to write longer texts. Feel free to follow me, though I suggest you to use Identi.ca instead of Twitter.