If anyone wants to try this without the intricate setup, if you have a linux system, you most probably can just press Ctrl+Alt+F3 and drop into a tty console directly. To return, you have to press Ctrl+Alt+F1 or Ctrl+Alt+F2. You also have multiple consoles, up until F12 probably.
I used to use this a lot when trying for a less distracting desktop, just like in the original post.
I'm a proverbial greybeard and Ctrl+Alt+F7 used to just be what you did to get back to your desktop GUI.
FWIW, right now I'm typing this from Ubuntu Studio 24.04 and it's Ctrl+Alt+F2 to get back to the GUI. Ctrl+Alt+F1 shows you the bootup output scroll, +F3 to +F6 will give you a login prompt to drop into a shell. +F7 to +F12 just give me a blinking cursor un the upper right corner of the display.
I'm kinda surprised only +F3 to +F6 give me a shell login. Three isn't that many.
I think Ctrl+Alt+F1-F7 are Kernel provided Virtual Console things, and technically they can be connected to different things, I think like VC 5 -> /dev/tty5 -> a thread in /sbin/getty? The VC 7 used to be often opened up for X, but Ubuntu moved X to VC 1 at some point. I guess it's VC 2 now.
Maybe this have changed over the years, and I rarely if ever used these combinations to switch to TTY except for emergency (OOM, or window manager breakage), but on every Linux system I ever used, graphical mode was on (Ctrl+Alt+)F7.
Using CachyOS right now, gdm/mutter/gnome ended up on Ctrl+Alt+F1, I can't remember if it was crunchbang or some other older distribution, but been others too using various numbers. I agree F7 is most common though.
Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff
The model is loaded once and can be used for multiple sessions, and even parallel requests.
llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.
If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.
So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.
Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).
There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...
I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.
On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.
How do you use `pi` to ssh? I use `oh-my-pi`, and tried the `/ssh` command, but I couldn't get it to work. Then I saw a suggestion somewhere to just run `!ssh` to place things into the agent's context.
Is there a way to use it like "The current directory is at `ssh server`" and have the agent work from there?
Just search for "TP Link ban", you will see a lot of news. I switched to SonicWall + Ubiquiti + my own monitoring software to be safe. I should've done it years ago, but I was lazy.
Has anybody figured some of the best flags to compile llama.cpp for rocm? I'm using the framework desktop and the Vulkan backend, because it was easier to compile out of the box, but I feel there's large peformance gains on the table by swtiching to rocm. Not sure if installing with brew on ubuntu would be easier.
If by the spirit, you only mean the bazaar model, then yes. But it's in the original spirit of free software. GNU preferred to keep the development somewhat contained, even so many years ago.
This is really nice to know. I remember trying to compile pandoc to Wasm after finding out that ghc had Wasm support, hitting all kinds of problems and then realising that there was no real way to post an issue to Haskell's gitlab repo without being pre-approved.
I guess now with LLMs, this makes more sense than ever, but it was a frustrating experience.
I used to use this a lot when trying for a less distracting desktop, just like in the original post.
reply