Feb09

Language Implementation Patterns

programming languages antlr | comments

One thing I am constantly thinking about is programming languages: implementing them, designing them, using them. The latter two are just fun, but implementing languages is a tricky business, and can be difficult and time-consuming.

There are some tools for helping you implement a language, such as the classical lex and yacc lexical analyzer and parser generator, respectively, or the more modern ANTLR, a combination of both.

However, learning how to even use these tools, or when to, is also difficult. Enter Language Implementation Patterns. This book, by Terence Parr (inventor of ANTLR, no less), attempts, quite successfully, to introduce the common patterns that you will use in implementing a language, with or without a parser-generator like ANTLR.

Most of all, I say that it is about freaking time for a book like this to come out. Compiler books have been mostly stagnating these past few years, and while this doesn't cover some of the more advanced optimization techniques and the like, it does a thorough job of showing how modern compilers are built. Bravo, Terence.

Dec17

Responsible administration for you and your users

certificates servers administration | comments

When I was constructing my own websites, for myself or for clients, I was always bothered by one thing:

I didn't have an SSL certificate for any of my domains.

This is due to many reasons: SSL certificates are tied to a single domain, they cost a lot of money, and they are tied to an IP address, which can make shared hosting hard. Furthermore, I don't feel that shelling out money for some SSL authority to issue me a certificate really means anything – why are they qualified to say anything (good or bad) about my website.

Unfortunately, there are only two alternatives: don't use SSL, or use a self-signed SSL certificate.

Not using SSL is not really an option if you are sending sensitive information of any kind over the Internet. For example, the password to login to the backend of any site.

If you have a user-facing website that is selling things, i.e., asking people for their credit card information, you don't have much of a choice. You really need to use SSL, and you have to use a real registered SSL certificate, since you don't want to scare away users.

However, for almost any other case, I think that they are just not necessary. Take this website, for example: the only person who logs into this domain is me. I certainly trust any certificate I sign, so a self-signed certificate makes a lot of sense. I can go ahead and "Accept the risk", as Firefox asks me to, and confirm an exception for my own site. I will then sleep easy knowing that my password is sent encrypted.

Even if I make a site for a client, I would be happy to give them the option: we can use this self-signed certificate for you to login to the backend, or they can front the cost for a "real" SSL certificate. However, nobody likes to waste money, and buying a certificate for only a few people to use a backend is a silly idea.

Creating a self-signed certificate with OpenSSL is easy, and installing it in Apache is also easy.

Don't let you or your clients down by neglecting to give them the option for free SSL.

Extra credit question: why aren't there are any free SSL signing authorities out there?

Sep26

Assembly language programming under OS X with NASM

nasm mac osx assembly programming | comments

One of my favorite passions from my teenage days was assembly language programming. Don't laugh.

It embodies a lot of my favorite things about programming: I have total control, it is clean and simple, and it is just and fast and functional as I am capable of making it. The only thing standing in the way of me and world domination is how well I can program.

Well, with some minor exceptions. Having a decent assembler is really key, and for x86 architectures, there are many choices. If you are hacking a small function together to support some higher-level language, then maybe you can get by using MASM (for Windows/DOS), or, if you really hate yourself, gas (for every platform).

But, if you intend to spend some time programming in assembly and not hating every minute of it, then you need to use NASM, probably the best assembler for the x86 family, ever (or possibly Yasm, a NASM clone that I have no experience with).

So, this past week I was interested in playing some more with NASM, and so I thought that I would see how what I could do under OS X (I previously worked primarily in DOS). Unfortunately, assembly language support in OS X is fairly hampered if you want follow its standard calling conventions for 32-bit x86 code. The innocuous-looking statement "The stack is 16-byte aligned at the point of function calls" seems innocent, but is a nightmare if you using external calls.

Basically, this means that you have to keep very close track of your stack size when calling functions. And even worse is that your stack never enters your function correctly aligned: the return address is always 4 bytes long, meaning you are always 12 bytes off when you start.

What this means is that I may have to use a VM with Linux to have any fun with assembly language programming again.

If anyone is interested, here is a quick NASM file I threw together that demonstrates how to use NASM on OS X to call glibc functions. I tested it with the latest NASM (2.07) on Snow Leopard (you'll probably need XCode installed to get this to work). This program prints "Hello World", allocates some memory using malloc, uses that memory to write 10 letters of the alphabet on the screen (using printf), frees the memory, and returns.
;
; Basic OS X calls to glibc
;
; compile with:
; nasm -g -f macho malloc.asm
; gcc -o a.out malloc.o
;
 
; glibc stuff
extern _puts, _printf, _malloc, _free
 
; static data
segment .data
 
hello_world_str db "Hello world!", 10, 0
int_str db "Address %x", 10, 0
 
; code
segment .text
 
global _main
 
_main:
        push ebp ; setup the frame
        mov  ebp, esp
 
        sub  esp, 4 ; align the stack
        push dword hello_world_str
        call _puts
        add  esp, 4
 
        ; malloc 16 bytes
        push  dword 16
        call  _malloc
        add  esp, 4
 
        ; check if the malloc failed
        test  eax, eax
        jz    fail_exit
         
        sub   esp, 0xC ; align the stack
        mov   ebx, eax
        push  ebx
        push  dword int_str
        call  _printf
        add   esp, 8
        add   esp, 0xC
 
        ; print "A\nB\n..."     
        mov   [ebx], dword 0xD41 ; 'A\n'
 
        mov   edi, 10
        push  ebx
_loop:
        call  _puts
        inc  dword [ebx] 
        dec  edi 
        jnz  _loop
 
        add  esp, 4
         
        ; free the malloc'd memory
        push  ebx
        call  _free
        add  esp, 4
        add  esp, 4 ; cleanup the stack 
        pop  ebp
        ret
         
fail_exit:
        mov  eax, 1
        pop  ebp
        ret


The output should look something like this:
Hello world!
 
Address 100130
A
B
C
D
E
F
G
H
I
J


Am I the only one who likes assembly language programming these days?

Aug23

Gentoo Linux FTW?

linux gentoo funny | comments

For the past 6 years or so, Gentoo has been my preferred distribution of Linux on the server side* for many reasons.

But, that doesn't stop me from laughing uncontrollably at Gentoo's Uncyclopedia entry. Pull quote from the top:

“Stop staring at my output; you have no life!”
  ~ GCC on Gentoo users


I have to admit — Gentoo is Linux at its most pretentious and obnoxious. Maybe I'm an elitist, but I still use and love it.

Why? What keeps me emerge-ing year after year?

  • I know Gentoo. I've been working with it for years, and when things go wrong, I know where to look and how to fix it. This is also partially due to the fact that it has (arguable, and in my opinion) one of the simplest overall system designs. I don't need a GUI to fix it, which is really handy for my remote installs where I won't even install a GUI.
  • It's fast. The rumors are true — if you know what you are doing and you take some time to do it, you can build a very fast system.
  • Upgradeable. The whole system can be easily brought up to speed by resyncing and emerging world. The only thing you ever have to reboot for is a kernel change, and you'll never have to reinstall to upgrade "versions". Gentoo doesn't have versions.
  • Up/downgradeable. If I want a custom-built version of, say, apache built, I can easily build such a thing and install it. If it is completely broken, I can easily uninstall it. And, 99% of the time, I won't ever have to open up the source and run ./configure --prefix=/opt/apache2 && make && sudo make install to do it, and I won't have to just hope that it won't clobber my files.


The first point is one primary reason I don't switch: I am locked in. I would maybe try Debian or Ubuntu for a server setup, but I'd be shaking in my boots waiting for something to break and not knowing where to start fixing the problem.

My biggest complaint about Gentoo? It doesn't deal well with bleeding edge Ruby gems. Meaning, Gentoo tries to install Ruby packages like Gentoo packages, but Gentoo packages are not always up-to-date or complete for all of the gems that are out there. So, there tend to be version problems and weird configuration errors if you start maintaining your own gem install somewhere. I'm not really sure whose fault this problem is.

Aside from this minor fault and the reputation that Gentoo gets for being elistist and complicated, I will be sticking to strange, bloated Pacman for the foreseeable future.





* My preferred distribution for desktop? Mac OS X. :) Sorry Linux, you just don't feel as smooth and silky in my hands with a GUI. The slick GUI that OS X gives me beats any minor incompatibility problems I have, or the fact that OS X uses strange broken BSD versions of some command-line tools, like sed.

Aug14

Brilliant programming

programming sorting searching | comments

Often, I like to read articles on particularly insightful programming techniques or algorithms. Sometimes these algorithms are for some fairly mundane things, like searching and sorting. (So mundane, of course, that Knuth wrote a whole book about it.)

Two pieces I have enjoyed recently.

First, timsort – the sorting algorithm used by Python. It describes an interesting approach to mergesort that is stable and incredibly fast, especially for "almost" sorted arrays.

Second is a class: strlen(), from glibc. strlen(), for you non-C programmers, finds the length of a null-terminated string. So, simply, given a location in memory, it finds the next zero byte. Sounds easy? Well, it is, but there are several tricks you can to do speed up the search by quite a bit.

Any other favorite snippets of code out there?

Jul13

Python Remote Debugger Announcement

python debugging remote cherrypy | comments

Recently, I desired to have a simple Python function I could call with as little fuss as possible that would start up some kind of server that I could use to tell what is going on inside my program at a later date, without resorting to some kind of logging system or console output.

Behold, a simple Python Remote Debugger.

Well, it's not really a debugger (yet). It's more of a "Current State of the System".

There are two servers shipped with it: a plain, pickle-based server (meant to be used to develop richer applications, though it may be abandoned in the future), and a simple web server, based on CherryPy.

It's still in beta, but has many neat features, including:
  • Listing of all currently running threads. Each entry includes a snapshot of the values of all of their local and global variables, as well as the current stack trace.
  • Listing of all loaded modules
  • sys.path listing
  • SSL support
  • Simple username / password authentication


I would be more than happy to take feature requests. Though, the source is available, fairly simple (and hopefully easy to use), and licensed under the MIT license, so you are free to use or modify it as you see fit.

If I have time in the future, I would love to develop some kind of fancy Ajax-based actual remote debugger, with IPython-like functionality, but it probably won't happen for a while.

Jun13

Bypassing the Python GIL with ctypes

python programming gil ctypes c | comments

I recently read an interesting article (actually, the slides linked to) about the horror that is the Global Interpreter Lock in Python, especially with multicore CPUs. And I agree — in these cases, the GIL is painful.

My favorite way of bypassing the GIL is to use ctypes, a wonderful library that allows you to easily link to dynamic libraries and call the functions from C, with only a small amount of boilerplate (to map function calls, argument types, and return types).

The best feature of ctypes is that when a program is executing a ctypes function, it releases the GIL. Meaning that if you have more than one thread threads, and one of them is busy with a ctypes call, the other threads can go along their merry way.

In the slides above, he shows that Python CPU-intensive multithreaded applications slow down as the number of CPUs increase. Well, I decided to use a quick counterexample.

First, I create a C file to do some work for me, called test.c:
int test(int from, int to)
{
  int i;
  int s = 0;
 
  for (i = from; i < to; i++)
    if (i % 3 == 0)
      s += i;
 
  return s;
}
To compile this as a dynamic shared library under OS X, the following two commands can be used:
gcc -g -fPIC -c -o test.o test.c
ld -dylib -o libtest.dylib test.o
(Under Linux, replace this last line with ld -shared -o libtest.so test.o)

Then, we can use the following Python program to load the dynamic library and run a quick test (should work in Linux or OS X):
import ctypes
import ctypes.util
import threading
import time
 
testname = ctypes.util.find_library('test')
testlib = ctypes.cdll.LoadLibrary(testname)
 
test = testlib.test
test.argtypes = [ctypes.c_int, ctypes.c_int]
 
def t():
  test(0, 1000000000)
 
if __name__ == '__main__':
  start_time = time.time()
  t()
  t()
  print "Sequential run time: %.2f seconds" % (time.time() - start_time)
 
  start_time = time.time()
  t1 = threading.Thread(target=t)
  t2 = threading.Thread(target=t)
  t1.start()
  t2.start()
  t1.join()
  t2.join()
  print "Parallel run time: %.2f seconds" % (time.time() - start_time)
On my quad-core OS X box, I get the following output:
Sequential run time: 13.27 seconds
Parallel run time: 6.66 seconds
A pretty solid doubling of performance, which is what we would hope.

May17

Presentations now up on slideshare

presentations | comments

Someone recently requested that I post some of my older presentations on miscellaneous topics.

When I was more actively involved with teaching and research at The University of Tulsa, I would often give presentations on random things. Programming, research I was doing, my anal retentiveness, whatever. We often had weekly meetings for most of my graduate program, so I would give talks there, or at ACM meetings.

I have been tracking some of these down and posting them to my slideshare account, although I will maintain a list of them (as well as descriptions) here on this site (you can always see them on the slide bar.

As a sample, here is one that is based on some (abandoned) research I did on applying genetic programming to chess.

Apr25

Netbook

netbook wireless drivers linux ubuntu | comments

I finally convinced myself to buy a netbook recently. I am currently typing this post on it.

I got an Asus EeePC 1000HA – that is, it's got a 10" screen, 160 GB hard drive (regular, not solid state), and a 6-cell battery (4.5 hours or so battery life under real-world conditions, I estimate). I'm pretty thrilled about it: typing is fine (though the right shift key has some adjustment time), the screen is nice and crisp, and it is very snappy under Windows XP and in Linux. The 1.6 GHz Atom CPU is just enough to play Netflix Instant videos, which was the only thing I was worried about performance-wise.

My biggest complaint was the wireless performance in various types of Ubuntu Linux (the newest 9.04 Netbook Remix is by far my favourite). I'd see 500+ ms ping time to my router, and drop about half of my packets, etc. It was miserable, and pretty much unusable. It turns out that the open source Atheros driver doesn't support this particular chipset very well, and the proprietary madwifi-based driver from Atheros doesn't support it at all. The only decent solution was to use the ndiswrapper package with Asus's Windows driver, which makes me feel a bit dirty.

But regardless, with all that sorted out, I won't be booting up Windows much more.

Overall, I'm impressed and very happy with my purchase.

Apr19

Responsible Development Practices: Storing Sensitive Information

development passwords practices tiddlywiki | comments

I build and maintain web sites. I manage probably hundreds of online accounts or logins or passwords that I need access to semi-regularly. Organizing all of this information can be a bit of a pain.

I've heard people swear by spreadsheets or plain text files for storing this information, and there are some specialized programs out there meant to catalog and organize this for you.

My solution is one of the simpler: TiddlyWiki. It's a small (few hundred K), self-contained (one HTML file), free file you can carry with you to store lots of information. I simply create some big list tiddlers (what they call their wiki pages), each one of which links to a specific tiddler for one particular task. So, I have a list for all of the web sites I maintain, and the tiddler for each has MySQL information, passwords, and any other crucial information.

This is great, but there are two problems. One, I am putting all of critical info in one place. To fix this, I maintain copies of this wiki file in several places, such as my thumb drive, laptop, and DropBox.

The second catch is then, what if someone finds this file? Wouldn't they have access to everything? This is why encryption is important. Whenever the file needs to be transported somewhere (like, on my thumb drive or put into DropBox), I first encrypt it. GPG is nice for this, but in a pinch, you can just use the ever-present OpenSSL to encrypt a file with something like:
openssl enc -aes-256-cbc -salt -in file.html -out file.html.encrypted


Now, as long as I remember this password, I can safely pass this encrypted file around without any worries about losing my thumb drive in the airport.

For most of the actual machines I log into via SSH, I simply use public key authentication, that way I don't have to remember the passwords.