Hacker Newsnew | past | comments | ask | show | jobs | submit | shawnz's commentslogin

Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder

Cool! It creates very plausible encodings.

> The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string.

Not having tokens be a prefix code is thoroughly unfortunate. Do the Llama team consider it a bug? I don't see how to rectify the situation without a full retrain, sadly.


I can't imagine they consider it a bug, it is a common and beneficial property of essentially every LLM today. You want to be able to represent common words with single tokens for efficiency, but at the same time you still need to be able to represent prefixes of those words in the cases where they occur separately

I find this surprising, but I suppose it must be more efficient overall.

Presumably parsing text into tokens is done in some deterministic way. If it is done by greedily taking the longest-matching prefix that is a token, then when generating text it should be possible to "enrich" tokens that are prefixes of other tokens with additional constraints to force a unique parse: E.g., if "e" is a token but "en" is too, then after generating "e" you must never generate a token that begins with "n". A text generated this way can be deterministically parsed by the greedy parser.

Alternatively, it would suffice to restrict to a subset of tokens that are a prefix code. This would be simpler, but with lower coding efficiency.


Regarding the first part: that's an interesting idea, although I worry it would bias the outputs in an unrealistic way. Then again, maybe it would only impact scenarios that would have otherwise been unparsable anyway?

Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality


The first approach would only affect outputs that would have been otherwise unparseable.

The second approach works with any subset of tokens that form a prefix code -- you effectively set the probability of all tokens outside this subset to zero (and rescale the remaining probabilities if necessary). In practice you would want to choose a large subset, which means you almost certainly want to avoid choosing any single-character tokens, since they can't coexist with tokens beginning with that character. (Choosing a largest-possible such subset sounds like an interesting subproblem to me.)


I don't think I see the vision here. If you want to maximize the number of tokens representable as a prefix code while still being able to output any sequence of characters, how could you possibly pick anything other than the one-character-long tokens?

Are you saying you'd intentionally make some output sequences impossible on the basis they're not likely enough to be worth violating the prefix code for? Surely there's enough common short words like "a", "the", etc that that would be impractical?

And even excluding the cases that are trivially impossible due to having short words as a prefix, surely even the longer words share prefixes commonly enough that you'd never get tokens longer than, say, two characters in the best case? Like, so many words start with "st" or "wh" or "re" or whatever, how could you possibly have a prefix code that captures all of them, or even the most common ones, without it being uselessly short?


> Surely there's enough common short words like "a", "the", etc that that would be impractical?

Tokens don't have to correspond to words. The 2-character tokens "a " and " a" will cover all practical uses of the lowercase word "a". Yes, this does make some strings unrepresentable, such as the single-character string "a", but provided you have tokens "ab", "ba", "ac", "ca", etc., all other strings can be represented. In practice you won't have all such tokens, but this doesn't materially worsen the output provided the substrings that you cannot represent are all low-probability.


Ah yeah, factoring in the whitespace might make this a bit more practical

I think it's plausible that different languages would prefer different tokenizations. For example in Spanish the plural of carro is carros, in Italian it's carro. Maybe the LLM would prefer carr+o in Italian and a single token in Spanish.

Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.

I don't think it is intending to frame the move as clueless, but rather short-sighted. It could very well be a good move for them in the short term.

One huge benefit of Tahoe for me is that you can now hide any menubar icon, even if they don't explicitly support hiding. It's a small thing but that alone makes the upgrade worth it for me

I used to think this, until I tried it. Now I see that it effectively removes all the tedium while still letting you have whatever level of creative control you want over the output.

Just imagine that instead of having to work off of an amorphous draft in your head, it really creates the draft right in front of you in actual code. You can still shape and craft and refine it just the same, but now you have tons more working memory free to use for the actually meaningful parts of the problem.

And, you're way less burdened by analysis paralysis. Instead of running in circles thinking about how you want to implement something, you can just try it both ways. There's no sunk cost of picking the wrong approach because it's practically instantaneous.


I’m getting the impression that developers vary substantially in what they consider tedium, or meaningful.

Sure, and that goes even for myself. Like for example, on some projects maybe I'll be more interested in exploring a particular architectural choice than actually focusing on the details of the feature. It ultimately doesn't matter, the point is that you can choose where to spend your attention, instead of being forced to always go through all the motions even for things that are just irrelevant boilerplate

Shockingly, software developers are people, and are as varied as people are elsewhere. Particularly since it became (relatively) mainstream.

Keep reading:

> Pieper emphasized that current over-the-counter NAD+-precursors have been shown in animal models to raise cellular NAD+ to dangerously high levels that promote cancer. The pharmacological approach in this study, however, uses a pharmacologic agent (P7C3-A20) that enables cells to maintain their proper balance of NAD+ under conditions of otherwise overwhelming stress, without elevating NAD+ to supraphysiologic levels.


Follow the citation: https://www.nist.gov/pml/time-and-frequency-division/how-utc...

> ... in English the abbreviation for coordinated universal time would be CUT, while in French the abbreviation for "temps universel coordonné" would be TUC. To avoid appearing to favor any particular language, the abbreviation UTC was selected.


Here are some of the things that make Firefox the best browser for me:

- An extension system more powerful than Chrome's, which supports for example rich adblockers that can block ads on Youtube. Also, it works on mobile, too

- Many sophisticated productivity, privacy, and tab management features such as vertical tabs, tab groups, container tabs, split tabs, etc. And now it also has easy-to-use profiles and PWA support just like Chrome

- A sync system which is ALWAYS end-to-end encrypted, and doesn't leak your browsing data or saved credentials if you configure it wrong, like Google's does, and it of course works on mobile too

- And yes, LLM-assisted summarization, translation, tab grouping, etc, most of which works entirely offline with local LLMs and no cloud interation, although there are some cloud enabled features as well


When/where was the PWA support added? I tried to test that this week and their docs say to use a third-party extension.


They're calling it taskbar tabs and it's behind a feature flag in nightly currently: https://windowsreport.com/firefox-is-bringing-web-apps-to-wi...


Thanks


My favourite feature is userChrome. The default chrome sucks in both Chrome and Firefox, but at least Firefox allows me to customize it to my liking without forking the entire browser.

On the flip side, changing keybinds in Firefox requires forking, but the defaults aren't too bad.


It's not necessarily performative research just because a pop science author wrote a catchy, exaggerated headline about it


I think finding an upper bound is basically just as difficult as finding the actual value itself, since both would require proving that all of the programs which run longer than that will run forever. That's why we can say BB(x) grows faster than any computable function. Being able to compute BB(x) algorithmically or any faster growing function would let you solve the halting problem


Sure, but I only asked about the single case x=6.


If you want an unproven-but-almost-certainly-correct upper bound on BB(6), consider BB(12).


Not sure if this is a joke, but actually that is guaranteed to be true. It is proven that for all n: BB(n+1) >= BB(n) + 3. But it is not proven that BB(n+1) >= BB(n) + 4, haha.


The point stands: the hard part is proving that all the programs with longer runtime than your upper bound will never terminate, and once you've solved that, getting the exact value is just a little extra work


For arbitrary n, that proof is arbitrarily hard, even undecidable for large enough n. Again though, for the specific case n=6, that difficulty has not yet been demonstrated, especially if you're willing to accept probabilistic arguments instead of rigorous proofs. n-by-n checkers is PSPACE-complete but the specific case n=8 that people actually play, has been completely solved using computers.


Would the split tabs feature that they are currently rolling out work for your use case?

https://windowsreport.com/hands-on-firefoxs-new-split-view-l...


Absolutely, 100%. I'm glad they're finally implementing it!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: