Multicore scheduling algorithms
You can be impressed by the computing power of a modern graphics card. 128 separate processing cores is quite a huge number, and you will want to share them among several processors. But they don't share busses and memory in the same way as general purpose processors. So how should you allocate the work to them?
Not to say that scheduling for general purpose processors is solved for all times, but hardly any effort has been made to schedule special purpose processors. These things have no caches, they have tiny amounts of local memory, the code that is executed on them is not magically retrieved from main memory, and the bus that you need to get all that data to the processors has -naturally- limited speed.
So you should simulate a few old-fashioned scheduling algorithms and check how they perform for workloads such as H.264 video encoding, and come up with improvements that take the unusual resource constraints of this new hardware into account.
