Tuesday, March 1, 2011

cudaram - a block device exposing NVIDIA GPUs' RAM implemented with CUDA

I have been looking for a sillylearning project for Linux kernel for some time and I think I have finally come up with something suitable.
Why not use the extra free RAM on your GPU for something useful while not hindering your normal GPU use (vdpau, desktop effects etc.)? I started looking and I didn't really find anything of the sort, the closest was an entry in the Gentoo Wiki - Use memory on video card as swap - but that approach forces you to use the vesa driver and you can't map the whole GPU RAM like that at least on my GTX 260 anyway.

I came to the conclusion that the most generic way would be to expose the extra resources via a block device.

Next up was actually figuring out how to do that. I have immediately thought about CUDA as I have had some contact with it before and I knew you can easily manage GPU RAM with it. And it's also possible to use it without disrupting the normal chores of your GPU - like actually displaying something on your monitor.

Sounds perfect, right? The only problem is that both CUDA toolkit and the NVIDIA drivers are closed-source and their interactions aren't documented anywhere. The only API they provide is in userspace and hence accessing it from kernel isn't easily doable. One could try and reverse-engineer the internal API, but I didn't want to go there with my first project especially as both the toolkit and drivers are constantly evolving and surely changing the API along the way.
I ended up deciding to be nice and use the CUDA userspace API. It complicates the design, but that actually might be a plus given that it's supposed to be a learning project. The final design follows:

cudaram kernel module <-> cudaramd userspace daemon <-> CUDA toolkit <-> nvidia kernel module

Basically, it is a block device with its storage implemented in userspace. There are similar things out there - like NBD and ABUSE. There is also FUSE, but that's at different level.
I decided to write my own module for two reasons, firstly I wanted to learn as much as possible and secondly it gives me the most flexibility should I need it later.

And so I did. I have pushed the code to https://github.com/peper/cudaram. There is a basic README included in the repo too if you are brave enough to try it out :) I wouldn't necessarily recommend that as at this point I would like to mostly gather feedback on my implementation.

Nevertheless it does seem to work and it's pretty fast at least for some loads:
$ mkfs.ext2 /dev/cudaram0
...
$ mount /dev/cudaram0 /mnt/cuda

# /mnt/tmpfs/foo is a 250MB file in tmpfs

# copy from tmpfs to cudaram
$ dd if=/mnt/tmpfs/foo of=/mnt/cuda/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.296378 s, 844 MB/s

# copy from cudaram to tmpfs
$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/mnt/cuda/foo of=/mnt/tmpfs/foo bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.275168 s, 909 MB/s

# copy from tmpfs to tmpfs
$ dd if=/mnt/tmpfs/foo of=/mnt/tmpfs/foo2 bs=$((1000*1000)) count=250 conv=fdatasync
250000000 bytes (250 MB) copied, 0.13663 s, 1.8 GB/s
So cudaram is about 2 times slower than tmpfs at copying one big file. Doesn't seem too bad at all for the first version. What helps it here is that in this load it's getting pretty big I/O requests. Where it might hurt is a lot of small requests - that should be obvious after reading the following overview of the implementation.

Currently the cudaram module creates a few cudaramX block devices with matching cudaramctlX control devices. The cudaramd daemon allocates the GPU RAM and a transfer buffer and starts communicating with the cudaram module via ioctl()s on the control device.

After initialization the flow is as follows:
  • ioctl() call start
  • Submit the last completed I/O request
  • If the I/O request direction was READ then the module copies the data from the transfer buffer
  • The module marks the request as completed
  • Sleep waiting for more requests
  • If there are pending I/O requests, take the first one from the queue
  • If the I/O request direction is WRITE then copy the data to the transfer buffer
  • Return the data required to complete the request (sector number etc.)
  • ioctl() call end
  • Perform the request, i.e. copy data between the GPU RAM and the transfer buffer
  • Start over
The I/O requests are queued asynchronously by the block device susbsytem.

For any more details I will have to redirect you to the source code.

TODO:
  • Figure out whether making swap to cudaram work is possible - currently it can deadlock! Might be especially tricky given that the nvidia driver is closed-source
  • Allocating GPU RAM in smaller chunks - to avoid fragmentation problems
  • Allocating GPU RAM on demand
  • Test different userspace-kernel communication schemes - e.g. vmapping the userspace buffer, adding separate read/write buffers, etc.
  • Make it more user-friendly
That's it for now. I would really appreciate all kinds of feedback, but especially a code review from kernel hackers :)

Saturday, July 10, 2010

Paludis 0.48.2 Released

I don't normally post about new Paludis releases, but this one brings a new client that many has been asking about.

From NEWS:

  • The ‘cave’ client is now enabled by default. ‘cave’ is a modular console client that will eventually replace ‘paludis’. It is currently reasonably functional and well tested, but does not yet have all the features present in ‘paludis’, and is thus not yet considered a complete replacement.


See this post for more on cave.

Tuesday, April 28, 2009

window.name hack taken a step further: full XHR proxying

My current GWT project requires a secure cross-site rpc solution so I started digging:

I liked the window.name hack the most and started implementing it in GWT. And while doing so I asked myself a question - why is it only limited to form submission? Well, it's not! You can use window.name for communication like in the #hash communication and do full XHR proxying. Look at this new (at least I didn't see it anywhere else) cross-site communication scheme:
  • Create an iframe
  • Encode XHR params and a dummy localUrl in the iframe's window.name
  • Change the iframe's location to the server's proxy script (i.e. if you want to send a request to example.org, example.org needs to provide a proxy script at e.g. example.org/cross_site_proxy.html)
  • The proxy script reads params from window.name and creates the real XHR
  • Fire the XHR and encode the response (all of it) in window.name
  • Change the location back to localUrl
  • Read the response from the iframe's window.name

It is important to set proper caching headers for both localUrl and server's proxy script so that they will be loaded from browser's cache w/o any additional requests.

Pros:
  • As secure as Fragment Identifiers XHR proxying and the original window.name hack
  • Full XHR proxying like in the Fragment Identifiers XHR proxying
  • No server changes needed other than providing the proxy script

I am currently finishing my proof of concept implementation in GWT and I will do a follow-up on it shortly. In the meantime it can probably be easily implemented in js libraries like dojo as they have most of the required bits already done.

What do you think?

Tuesday, April 7, 2009

paludis.org

Just a quick note: I have taken over paludis.org and it finally points to where it should!

Sunday, March 22, 2009

Benchmark Paludis 0.38 and Portage 2.1.6.9 #2

As it seems unclear to some, I have redone my benchmarks today with extra comments.

peper is using different configuration files for portage and paludis. So the comparison is quite ... uhm ... difficult.
I did include -E portage timinings too, didn't I? But let's look again.
# time paludis -ip sys-apps/portage -E portage
real    1m39.186s
user    1m2.442s
sys     0m38.543s
<bonsaikitten> dleverton: since I don't have any paludis config files my results are what should happen, -E seems to not do things the same (or there are some artifacts from the previous config left)
Still not enough? Ok. (Yes, I know these are silly.)
# mv /etc/paludis /etc/paludis.hidden
# time paludis -ip sys-apps/portage
real    1m39.407s
user    1m2.099s
sys     0m38.633s
# time paludis -ip sys-apps/portage -E portage
real    1m39.948s
user    1m2.133s
sys     0m38.873s

And portage didn't get any faster since yesterday either:
# time emerge -puD sys-apps/portage
real    5m30.694s
user    3m30.119s
sys     1m56.093s

What am I missing this time?

And to get things back to normal:
# mv /var/lib/gentoo/repositories/gentoo/metadata.hidden /var/lib/gentoo/repositories/gentoo/metadata
# time emerge --metadata
real    2m58.175s
user    0m16.025s
sys     0m8.808s

# time paludis --metadata
Usage error: Error handling command line: Bad argument '--metadata'

real    0m0.022s
user    0m0.007s
sys     0m0.004s
Oh, right, paludis doesn't do that.
And now:
# time emerge -puD sys-apps/portage
real    0m4.196s
user    0m3.924s
sys     0m0.177s

# time paludis -ip sys-apps/portage -E portage
real    0m3.082s
user    0m2.554s
sys     0m0.518s

My /etc/make.conf for the record:
CFLAGS="-march=athlon64 -O2 -pipe"
CXXFLAGS="$CFLAGS"
CHOST="x86_64-pc-linux-gnu"
USE="mmx sse sse2 -gnome hal -cups bash-completion -ldap vim-syntax laptop"

PORTDIR="/var/lib/gentoo/repositories/gentoo"

MAKEOPTS="-j2"

FEATURES="distcc"

KBUILD_OUTPUT=/usr/src/build-current
VIDEO_CARDS="nvidia"
INPUT="evdev"
DISTDIR="/home/data/distfiles"

ACCEPT_KEYWORDS=~amd64

P.S. How do you call having comments blocked completly if it's "braindamaged" to allow comments from multiple popular services?

Saturday, March 21, 2009

Benchmark Paludis 0.38 and Portage 2.1.6.9

I didn't quite believe the benchmark done by bonsaikitten so I did my own and wasn't able to reproduce the results.

Env (same as bonsai used)
  • Portage 2.1.6.9
  • Paludis 0.36.0
  • hot kernel cache
  • no metadata cache in gentoo repo
  • no package manager on-disk cache (nuked /var/cache/edb/dep and /var/cache/paludis/metadata/gentoo)

# time emerge -puD sys-apps/portage
real    5m32.896s
user    3m36.263s
sys     1m53.655s

# time paludis -ip sys-apps/portage
real    1m31.509s
user    1m0.432s
sys     0m32.432s

# time paludis -ip sys-apps/portage -E portage (bonsai doesn't have a paludis config)
real    1m36.575s
user    1m2.969s
sys     0m36.501s

And with cold kernel cache (echo 3 > /proc/sys/vm/drop_caches)
# time emerge -puD sys-apps/portage
real    6m25.499s
user    3m34.228s
sys     1m53.624s

# time paludis -ip sys-apps/portage
real    2m38.976s
user    1m3.312s
sys     0m38.318

So reading the necessary ebuilds (and friends) takes roughly a minute on my box.

A little curiosity as a bonus:
# strace -e file paludis -ip sys-apps/portage 2>&1 | wc -l
14219
# strace -e file emerge -puD sys-apps/portage 2>&1 | wc -l
32240

Interesting difference, let's look closer at what portage does:
access("/var/lib/gentoo/repositories/gentoo/sys-devel/autoconf/autoconf-2.63.ebuild",
stat("/var/lib/gentoo/repositories/gentoo/sys-devel/autoconf/autoconf-2.63.ebuild",
open("/var/cache/edb/dep/var/lib/gentoo/repositories/gentoo/sys-devel/autoconf-2.63",
stat("/var/lib/gentoo/repositories/gentoo/sys-devel/autoconf/autoconf-2.63.ebuild",
getcwd("/etc"...,
lstat("/var",
lstat("/var/lib",
lstat("/var/lib/gentoo",
lstat("/var/lib/gentoo/repositories",
lstat("/var/lib/gentoo/repositories/gentoo",
lstat("/home",
lstat("/home/data",
lstat("/home/data/distfiles",
lstat("/usr",
lstat("/usr/portage",
lstat("/usr/portage/rpm",
lstat("/var",
lstat("/var/tmp",
stat("/var/tmp/portage/sys-devel/autoconf-2.63/temp/environment",
stat("/usr/bin/sandbox",
access("/usr/bin/sandbox",
And the same for each ebuild read from the repo. Highlighted lines are interesting...

Tuesday, March 3, 2009

Mounting a raw kvm/qemu/... image

I have been doing some kernel hacking recently (mostly uni tasks) and the ability to mount a kvm image turned out very useful. All you need is losetup (and loop device support in kernel of course) and kpartx. On gentoo they are in sys-apps/util-linux and sys-fs/multipath-tools accordingly.

Create a new image:

# kvm-img create -f raw gentoo.raw.img 4G
# losetup -f
/dev/loop/0
# losetup /dev/loop/0 gentoo.raw.img
# cfdisk /dev/loop/0
# kpartx -av /dev/loop/0
add map 0p1 (253:1): 0 8385867 linear /dev/loop0 63
# mkfs /dev/mapper/0p1
# kpartx -dv /dev/loop/0
del devmap : 0p1
# losetup -d /dev/loop/0

Mount an already formatted image:

# kpartx -av gentoo.raw.img
add map loop0p1 (253:1): 0 8385867 linear /dev/loop0 63
# mount /dev/mapper/loop0p1 /mnt/guests/gentoo-kvm

Note that it's dangerous to have the image mounted like that and use it with kvm/qemu/etc at the same time, so here is the reverse process:

# umount /mnt/guests/gentoo-kvm
# kpartx -dv gentoo.raw.img
del devmap : loop0p1
loop deleted : /dev/loop0

Monday, January 14, 2008

UNINSTALL_PROTECT in Paludis #3

After a fair time of using the hook described in the previous post I realized that it noticeably slows down the uninstalls of packages with lots of files like sys-kernel/gentoo-sources and hence I decided to rewrite it as a Python hook to see how it performs:
UNINSTALL_PROTECT=["/lib64/modules/"]

def hook_run_unmerger_unlink_file_override(env, hook_env):
    for path in UNINSTALL_PROTECT:
        if hook_env["UNLINK_TARGET"].startswith(path):
            return "skip"
        else:
            return ""
I was expecting a pretty big difference as for Python hooks an embedded interpreter is used and .hook hooks need a separate bash interpreter each time they are run. Nonetheless I was still surprised to see how big the difference really was for sys-kernel/gentoo-sources-2.6.23-rX uninstallation:
.hook hook ~ 8m
python hook ~ 1m 15s
no hook ~ 1m
Python hook is not much slower than no hook at all and hence I didn't care to try and write it as a .so hook, which probably would be even faster.

Friday, August 24, 2007

Python 2.5 and Boost.Python

As Python 2.5 is now unmasked in Gentoo I decided to try it with Boost.Python, more specifically with Python bindings for Paludis. Everything seems to work, but, of course, boost needs to be rebuilt first.

Monday, July 9, 2007

Boost.Python: docstrings in enums

For quite some time I wanted to add docstrings to enums exposed with Boost.Python as my project's API docs seemed incomplete. Let's see how I found the solution eventually:

Enum we want to expose:
// Nice enum!
enum Foo
{
    blah
};

While exposing the enum itself couldn't be any simpler:
BOOST_PYTHON_MODULE(enum_test)
{
    bp::enum_<Foo> enum_foo("Foo");
    enum_foo.value("blah", blah);
    ...
... adding __doc__ is completly another story.

First (silly) attempt:
...
    PyObject_SetAttrString(enum_foo.ptr(), "__doc__",
        PyString_FromString("Nice enum!"));
};
"It must work" I thought, but what I got on import then was far from satisfying:

TypeError: attribute '__doc__' of 'type' objects is not writable

After digging in Boost.Python, Python C API docs and harassing people at #python:
...
    PyTypeObject * pto = reinterpret_cast(enum_foo.ptr());
    pto->tp_doc = "Nice enum!";
};
Thils looked really promising... and what? Nothing at all! Changing tp_doc had no effect whatsoever, and only after some more digging, it turned out it was too late to change it and the right way is:
...
    PyTypeObject * pto = reinterpret_cast(enum_foo.ptr());
    PyDict_SetItemString(pto->tp_dict, "__doc__",
        PyString_FromString("Nice enum!"));
};
Boost.Python and Python C API are fun ;)

Wednesday, June 20, 2007

Python bindings now in the scm ebuild!

Python bindings are now available in the paludis-scm.ebuild from the Paludis overlay. Just set the python use-flag and reinstall. Comments are welcome, but keep in mind it's far from the final version :]

Additional links: API docs, repository

Saturday, June 16, 2007

Python bindings for Paludis #1

I was supposed to give some updates about my SoC project, so here is the much delayed first one.

I have been working on the bindings since the moment I thought about the project and had some code ready even for the initial project proposal. After it had been accepted I was continuing the work with bigger and smaller breaks. Meanwhile I was doing some contributions to Paludis in other areas, learning its internals and getting better at C++.

I believe the first big date for my project was 5 April 2007 when bindings code was initially imported to the Paludis repository (r2881). The second big date is still coming, which will be the release of Paludis 0.26 with the Python bindings included.

The exact status of the bindings is best seen in the repository or in the API docs, which I update every now and then.

Currently I am working on bringing *DepSpec up to date and preparing for yet another Paludis API change. The bigger plan is to finish bindings for all the core classes sometime soon...

Friday, May 25, 2007

UNINSTALL_PROTECT in Paludis #2

Update: Check this for a better performing python version of this hook.

I need to make a follow-up again as the hack described in my previous post is now obsolote as Paludis' trunk has now _override hooks for merger and unmerger actions. They are a little different than all the hooks till now as their effect is determined from their output, more specifically unmerger_unlink_*_override has two options:
  • "skip" - skips the action
  • "force" - force it (without the usual tests like type or mtime check)
And merger_install_*_override has only the "skip" option. Also no output means the default.

Let me show an example:

/etc/paludis/hooks/unmerger_unlink_file_override/uninstall_protect.hook:
#!/bin/bash

UNINSTALL_PROTECT="/lib64/modules/"

hook_run_unmerger_unlink_file_override() {
    for PROTECT in ${UNINSTALL_PROTECT}; do
        if [[ "${UNLINK_TARGET}" == "${PROTECT}"* ]]; then
            echo "skip"
        fi
    done
}
As easily you can make INSTALL_MASK hook and others...

This and more in forthcoming 0.26.

Friday, May 11, 2007

UNINSTALL_PROTECT in Paludis

Update: Check this for the current bash version of this hook or this for a better performing python one.

This is somehow related to my previous posts about the moduledb as the idea also arose when playing with kernel modules.

When installing modules for a new kernel I noticed that Paludis treated them as any other package and removed the old installs leaving my old kernel without the modules. Nothing really surprising here, just different than the portage behaviour. Wanting to avoid this in the future my first thought was just to add /lib64/modules to the CONFIG_PROTECT, but after a second I realised it wasn't the best idea as I didn't want to review modules' changes when upgrading, I just wanted them to be not removed when uninstalling. Hence that it wasn't long until making an Unmerger Hook came to my mind.

/etc/paludis/hooks/uninstall_protect.hook:
#!/bin/bash

UNINSTALL_PROTECT="/lib64/modules/"

hook_run_unmerger_unlink_file_pre() {
    for PROTECT in ${UNINSTALL_PROTECT}; do
        if [[ "${UNLINK_TARGET}" == "${PROTECT}"* ]]; then
            mv "${UNLINK_TARGET}" "${UNLINK_TARGET}.protect"
        fi
    done
}

hook_run_unmerger_unlink_file_post() {
    if [[ -e "${UNLINK_TARGET}.protect" ]]; then
        mv "${UNLINK_TARGET}.protect" "${UNLINK_TARGET}"
        echo "protected '${UNLINK_TARGET}'"
    fi
}

And two symlinks pointing at it:

/etc/paludis/hooks/unmerger_unlink_file_pre/uninstall_protect.hook -> ../uninstall_protect.hook

/etc/paludis/hooks/unmerger_unlink_file_post/uninstall_protect.hook -> ../uninstall_protect.hook

Monday, April 23, 2007

moduledb in Paludis #2

This is only a short follow-up to my previous post.
Support of dynamic configuration files has been implemented a few days ago and for a start I have made two simple dynamic sets:

kernel-modules-ver.bash (set of kernel-modules with exact versions):
#!/bin/bash

sed -e 's/.*:/* =/' /var/lib/module-rebuild/moduledb

kernel-modules.bash (set of kernel-modules - unversioned):
#!/bin/bash

shopt -s extglob

while read PKG; do
    PKG=${PKG##*:}
    PKG=${PKG%%-scm*([[:digit:]])}
    PKG=${PKG%%-[[:digit:]]*([^-]|-[^[:digit:]])}
    echo "* ${PKG}"
done < /var/lib/module-rebuild/moduledb

And that's really only the beginning...

Saturday, April 14, 2007

moduledb in Paludis

Update: Check this for the current solution.

With portage I was quite often using module-rebuild, more or less after every kernel update and thus I thought it would be nice to make something similar for Paludis, especially as it seemed really simple with Paludis' Hooks and Sets.

So I made two simple hooks:
And a simple one-liner to import entries from the moduledb:

sed -e 's/.*:/* =/' /var/lib/module-rebuild/moduledb > ${PALUDIS_CONFDIR}/sets/modules.conf

But then I realized that it's not so perfect:
  • Won't work when using Environment != Paludis ( ${PALUDIS_CONFDIR} won't be reliable then )
  • To actually rebuild the modules one has to use: paludis --dl-reinstall always --dl-deps-default discard -i modules and hope that the deps are satisfied :]
  • Doubles the work of the linux-mod.eclass
So I decided to talk with Ciaran about the above issues. Yet while we were talking he implemented a new paludis option (--dl-reinstall-targets auto|always|never), which affects the way paludis treats targets. Up till now the default "auto" was always used(="never" for sets and "always" for ordinary packages). When it is released the module rebuild will be simply done with:

paludis --dl-reinstall-targets always -i modules and there won't be any risk of missing deps.

Going further ciaran thought about allowing bash scripts in sets/, in short paludis runs a foo.bash script from sets/ and its output is interpreted as foo set. The moment it is implemented (probably today) and released we can forget about the two hooks above and make a modules.bash similar to the import one-liner.

So don't forget to update Paludis when it's out ;]

My brand new blog and Google Summer of Code

Welcome to my brand new blog!

Thanks to seredipity setting it up was quite easy :] I've been thinking about making one for quite some time and today I have eventually decided to do so. Why today? Well, GSoC!
My proposal was accepted as one of the Gentoo projects. In short, my task is to make python bindings for Paludis, the Other Package Mangler for Gentoo. If you want more details, check out my application and, if you are even more desperate, code, which I have already written. Also ciaran was nice enough to write a few words about it, but don't forget to ignore the rubbish about Python :]
So... I am planing to blog about progress of my project, some Gentoo stuff, and... time will tell.

P.S. 1) My name is Piotr Jaroszyński (hence the domain), I live in Warsaw and am first year Computer Science student at Warsaw University.
P.S. 2) I hope my pathetic 256 kbit/s upload will manage ;]