I live in Ontario where we go down to -30C in the harshest conditions.
We have a heat pump and a furnace and they alternate based on efficiency
Somewhere around -5 to +5 C it switches from the heat pump to the furnace
I think you could get by a bit colder but it really loses out on efficiency vs burning gas unless you invest in a geothermal heat pump
I use text-generation-webui mostly. If you’re only using GGUF files (llama.cpp), koboldcpp is a really good option
A lot of it is the automatic prompt formatting, there’s probably like 5-10 specific formats that are used, and using the right one for your model is very important to achieve optimal output. TheBloke usually lists the prompt format in his model card which is handy
Rope and yarn refer to extending the default context of a model through hacky (but functional) methods and probably deserve their own write up
Yeah so those are mixed, definitely not putting each individual weight to 2 bits because as you said that’s very small, i don’t even think it averages out to 2 bits but more like 2.56
You can read some details here on bits per weight: https://huggingface.co/TheBloke/LLaMa-30B-GGML/blob/8c7fb5fb46c53d98ee377f841419f1033a32301d/README.md#explanation-of-the-new-k-quant-methods
Unfortunately this is not the whole story either, as they get further combined with other bits per weight, like q2_k is Q4_K for some of the weights and Q2_K for others, resulting in more like 2.8 bits per weight
Generally speaking you’ll want to use Q4_K_M unless going smaller really benefits you (like you can fit the full thing on GPU)
Also, the bigger the model you have (70B vs 7B) the lower you can go on quantization bits before it degrades to complete garbage
If you’re using llama.cpp chances are you’re already using a quantized model, if not then yes you should be. Unfortunately without crazy fast ram you’re basically limited to 7B models if you want any amount of speed (5-10 tokens/s)
Wtf? This is a weird take lol
Yeah definitely need to still understand the open source limits, they’re getting pretty dam good at generating code but their comprehension isn’t quite there, I think the ideal is eventually having 2 models, one that determines the problem and what the solution would be, and another that generates the code, so that things like “fix this bug” or more vague questions like “how do I start writing this app” would be more successful
I’ve had decent results with continue, it’s similar to copilot and actually works decently with local models lately:
Thanks for the comment! Yes this is meant more for your personal projects than for using in existing projects
The idea behind needing a password to get a password, totally understand, my main goal was to have local encrypted storage, the nice thing about this implementation is that you can have all your env files saved and shared in your git repo for all devs to have access to, but only can decrypt it if given the master password shared elsewhere (keeper, vault etc) so you don’t have to load all values from a vault, just the master
100% though this doesn’t cover a large range of usage, hence the name “simple” haha, wouldn’t be opposed to expanding but I think it covers my proposed use cases as-is
I have the snap installed, for what it’s worth it’s pretty painless AS LONG AS YOU DON’T WANT TO DO ANYTHING SILLY
I’ve found it nearly impossible to alter the base behaviour and have it not entirely break, so if nextcloud out of the box does exactly what you want, go ahead and install it via snap…
I predict that on docker you’re going to have a bad time if you can’t give it host network mode and try to just forward ports
That said, docker >>>> VM in my books
Yes agreed on the llama-2 models, they show a LOT of promise in the right tasks but they need some work to get back to what we remember from peak llama-1, i’m very excited for when that arrives in a week or two!
Yeah by all means! At this time I’d say text-generation-webui is my most mature and functional image, with koboldcpp being a close second but I just don’t work as closely with it
lollms-webui is a very interesting upcoming platform but it’s a solo dev so it’s a lot of work, my docker image works as long as you don’t need any personalities, but i’m working on that to see if I can get it sorted out :) for now though it’s definitely worth considering it beta or maybe even alpha
Would love to keep our communities tightly knit, FOS AI and localllama both have similar ideals coming from two different angles, so keep in touch :D
Hey thanks for the detailed writeup, this is great! Probably worth including a couple of the llama 1 models just because they’re more mature and ready to be used even tho licensing is awkward
Also if you’d like I maintain a few docker images for a couple tools (namely oobabooga, koboldcpp, and lollms-webui) that might be good for beginners to get their feet wet, can find them pinned at https://github.com/noneabove1182
Yes that’s a good comment for an FAQ cause I get it a lot and it’s a very good question haha. The reason I use it is for image size, the base nvidia devel image is needed for a lot of compilation during python package installation and is huge, so instead I use conda, transfer it to the nvidia-runtime image which is… also pretty big, but it saves several GB of space so it’s a worthwhile hack :)
but yes avoiding CUDA messes on my bare machine is definitely my biggest motivation
lollms-webui is the jankiest of the images, but that one’s newish to the scene and I’m working with the dev a bit to get it nicer (main current problem is the requirement for CLI prompts which he’ll be removing) Koboldcpp and text-gen are in a good place though, happy with how those are running
You shouldn’t need nvlink, I’m wondering if it’s something to do with AWQ since I know that exllamav2 and llama.cpp both support splitting in oobabooga