Hacker Newsnew | past | comments | ask | show | jobs | submit | bunnie's commentslogin

The core ID definitely didn't need to be in a register, but the elapsed clocks since reset is actually really handy. Having this in the hot path allows me to build a captouch sensor using the BIO, because the clock increment is 1.42ns and even though the rise time of the pad is microseconds you get plenty of resolution at that counting rate.

I think it will be interesting to see what people end up doing with it and what are the pain points. As you say, it's a v1 - with any luck there will be a v2, so we could consider the time starting now as a deliberation period for what goes into v2.

The good news is that it also all compiles into an FPGA, so proposed patches can be tested & vetted in hardware, albeit at a much slower clock rate.


Ah, thank you for the example, I understand how a linearly-increasing counter can be useful, if you use it that way. It would obviously be more versatile with write access & configurable clock dividers, pre-setters, counting direction, etc. The current design probably allows re-using the counter across cores & minimize space, so makes sense to me. I should dig into the RTL when I have a bit of time… Maybe I'll make it my bedside reading?

You could also say it's up to the user to implement a fully-fledged timer/counter in a BIO coprocessor if they need one, though ideally there would be a shared register (or a way to configure the FIFOs depth + make them non-blocking) to communicate the result.

Small cores like these are really fun to play with: the constraints easily fit in your head, and finding some clever way to use the existing HW is very rewarding. Who needs Zachtronics games when you have a BIO or PIO?


Yah, it is - the text is first posted to the campaign, and then copied to my blog for long-term archival in a domain that I control, sans the sales pitch.


It's hard to know for sure, because we don't have access to the PIO's implementation, but I suspect that the PIO is "not small".

That being said - size isn't everything. At these small geometries you have gates to burn, and having access to multiple shifts in a single cycle really do help in a range of serialization tasks.


I suspect there are tricks to get higher rates, for sure. And hopefully once we see a library of applications forming, we can make informed decisions about what extensions and features would be necessary to enable the next level of I/O performance.


FIFO is 8-deep. I did fail to mention that explicitly in the article, I think. The depth is so automatic to me that I forget other people don't know it.

The deadlock possibilities with the FIFO are real. It is possible to check the "fullness" of a FIFO using the built-in event subsystem, which allows some amount of non-blocking backpressure to be had, but it does incur more instruction overhead.


Correct, actually most programs I've written for the BIO are in assembly.

The C compiler support is a relatively recent addition, mostly to showcase the possibilities of doing high-level protocol offloading into the BIO, and the tooling benefits of sticking with a "standard" instruction set.


As a side note about speed comparisons - please keep in mind the faster speeds cited for the PIO are achieved through overclocking.

The BIO should also be able to overclock. It won't overclock as well as the PIO, for sure - the PIO stores its code in flip-flops, which performance scales very well with elevated voltages. The BIO uses a RAM macro, which is essentially an analog part at its heart, and responds differently to higher voltages.

That being said, I'm pretty confident that the BIO can run at 800MHz for most cases. However, as the manufacturer I have to be careful about frequency claims. Users can claim a warranty return on a BIO that fails to run at 700MHz, but you can't do the same for one that fails to run at 800MHz - thus whenever I cite the performance of the BIO, I always stick it at the number that's explicitly tested and guaranteed by the manufacturing process, that is, 700MHz.

Third-party overclockers can do whatever they want to the chip - of course, at that point, the warranty is voided!


The idea of the wait-to-quantum register is that it gets you out of cycle-counting hell at the expense of sacrificing a few cycles as rounding errors. But yes, for maximum performance you would be back to cycle counting.

That being said - one nice thing about the BIO being open source is you can run the verilog design in Verilator. The simulation shows exactly how many cycles are being used, and for what. So for very tight situations, the open source RTL nature of the design opens up a new set of tools that were previously unavailable to coders. You can see an example of what it looks like here: https://baochip.github.io/baochip-1x/ch00-00-rtl-overview.ht...

Of course, there's a learning curve to all new tools, and Verilator has a pretty steep curve in particular. But, I hope people give the Verilator simulations a try. It's kind of neat just to be able to poke around inside a CPU and see what it's thinking!


Actually, the PIO does what it does very well! There is no "worse" or "better" - just different.

Because it does what it does so well, I use the PIO as the design study comparison point. This requires taking a critical view of its architecture. Such a review doesn't mean its design is bad - but we try to take it apart and see what we can learn from it. In the end, there are many things the PIO can do that the BIO can't do, and vice-versa. For example, the BIO can't do the PIO's trick of bit-banging DVI video signals; but, the PIO isn't going to be able to protocol processing either.

In terms of area, the larger area numbers hold for both an ASIC flow as well as the FPGA flow. I ran the design through both sets of tools with the same settings, and the results are comparable. However, it's easier to share the FPGA results because the FPGA tools are NDA-free and everyone can replicate it.

That being said, I also acknowledge in the article that it's likely there are clever optimizations in the design of the actual PIO that I did not implement. Still, barrel shifters are a fairly expensive piece of hardware whether in FPGA or in ASIC, and the PIO requires several of them, whereas the BIO only has one. The upshot is that the PIO can do multiple bit-shifts in a single clock cycle, whereas the BIO requires several cycles to do the same amount of bit-shifting. Again, neither good or bad - just different trade-offs.


> The upshot is that the PIO can do multiple bit-shifts in a single clock cycle... it's likely there are clever optimizations in the design of the actual PIO that I did not implement

I was curious, so looked into this. From what I can tell, PIO can only actually do a maximum of two shifts per cycle. That's one IN, OUT, or SET instruction plus a side-set.

And the side-set doesn't actually require a full barrel shifter. It only ever needs to shift a maximum of 5 bits (to 32 positions), which is going to cut down its size. With careful design, you could probably get away with only a single 32-bit barrel shifter (plus the 5-bit side-set shifter).

Interestingly, Figure 48 in the RP2040 Datasheet suggests they actually use seperate input and output shifters (possibly because IN and OUT rotate in opposite directions?). It also shows the interface between the state machine input/output mapping, pointing out the two seperate output channels.


Thanks btw for saying clearly that BIO is not suitable for DVI output. I was curious about this and was planning to ask on social media.

I've done some fun stuff in PIO, in particular the NRZI bit stuffing for USB (12Mbps max). That's stretching it to its limit. Clearly there will be things for which BIO is much better.

I suspect that a variant of BIO could probably do DVI by optimizing for that specific use case (in particular, configuring shifters on the output FIFO), but I'm not sure it's worth the lift.


USB 12Mbps is one of the envisioned core use cases - the Baochip doesn't have a host USB interface, so being able to emulate a full-speed USB host with a BIO core opens the possibility of things like having a keyboard that you can plug into the device. CAN is another big use case, once there is a CAN bus emulator there's a bunch of things you can do. Another one is 10/100Mbit ethernet - it's not fast - but good for extremely long runs (think repeaters for lighting protocols across building-scale deployments).

When considering the space of possibilities, I focused on applications that I could see there being actual product sold that rely upon the feature. The problem with DVI is that while it's a super-clever demo, I don't see volume products going to market relying upon that feature. The moment you connect to an external monitor, you're going to want an external DRAM chip to run the sorts of applications that effectively utilize all those pixels. I could be wrong and mis-judged the utility of the demo but if you do the analysis on the bandwidth and RAM available in the Baochip, I feel that you could do a retro-gaming emulator with the chip, but you wouldn't, for example, be replacing a video kiosk with the chip. Running DOOM on a TV would be cool, but also, you're not going to sell a video game kit that just runs DOOM and nothing else.

The good news is there's plenty of room to improve the performance of the BIO. If adoption is robust for the core, I can make the argument to the company that's paying for the tape-outs to give me actual back-end resources and I can upgrade the cores to something more capable, while improving the DMA bandwidth, allowing us to chase higher system frequencies. But realistically, I don't see us ever reaching a point where, for example, we're bit-banging USB high speed at 480Mbps - if not simply because the I/Os aren't full-swing 3.3V at that point in time.


My feeling about programmable IOs is they’re fun, but not the right choice for commodity high speed interfaces like USB. You obviously can make them work, but they’re large compared to what you would need for a dedicated unit. The DVI over PIO is a good example: showed something interesting (and that’s great!) but not widely useful. Also, a lot of protocols, even slow ones, have failure and edge cases that would need to be covered. Not to mention the physical characteristics, like you’ve said for high speed USB.


This is true, but only relevant if you order enough units (>100 k? Depending on price & margin of course) to customize your die. Otherwise, you have to find a chip with the I/Os that you want, all the rest being equal. Good luck with that if you need something specific (8 UARTs for instance) or obscure.


Yes, I can see BIO being really good at USB host. With 4k of SRAM I can see it doing a lot more of the protocol than just NRZI; easily CRC and the 1kHz SOF heartbeat, and I wouldn't be surprised if it could even do higher level things like enumeration.

You may be right about not much scope for DVI in volume products. I should be clear I'm just playing with RP2350 because it's fun. But the limitation you describe really has more to do with the architectural decision to use a framebuffer. I'm interested in how much rendering you can get done racing the beam, and have come to the conclusion it's quite a lot. It certainly includes proportional fonts, tiles'n'sprites, and 4bpp image decompression (I've got a blog post in the queue). Retro emulators are a sweet spot for sure (mostly because their VRAM fits neatly in on-chip SRAM), but I can imagine doing a kiosk.

Definitely agree that bit-banging USB at 480Mbps makes no sense, a purpose-built PHY is the way to go.


Hello again HN, I'm bunnie! Unfortunately, time zones strike again...I'll check back when I can, and respond to your questions.


I will forever be grateful to Bunnie, he pointed me in the direction of murmurhash when I needed something to help with the integrity of a section of memory in a microcontroller. Legend.


Have you looked at TI's PRU at all?


Emulating the RPI PIOs instead of the TI PRUs is really a miss.

The PRUs really get a bunch right. Very specifically, the ability to broadside dump the ENTIRE register file in a single cycle from one PRU to the other is gigantic. It's the single thing that allows you to transition the data from a hard real-time domain to a soft real-time domain and enables things like the industrial Ethernet protocols or the BeagleLogic, for example.


Tooling for the RPI PIO design is probably a bit more accessible than the TI PRU situation. I'd say its not really a miss - more of a necessity given bennies' proclivity towards open/available tools. Getting access to architecture details of the TI PRU would necessitate an NDA, would it not?


> Getting access to architecture details of the TI PRU would necessitate an NDA, would it not?

Nope. All the information is right in the publicly available architecture manuals. However, you don't need to copy the PRUs, per se. All this can be done with RISC-V.

The important parts are deterministic execution, the register file sideload between paired processors, and, possibly, single cycle instruction execution. None of these are precluded by using RISC-V.

And, given how large his PIO stuff is, I'd argue it would be better to do this with RISC-V.


very cool. tiny processors everywhere. but be nice to PIO. PIO is good :)


Agreed! The PIO is great at what it does. I drew a lot of inspiration from it.


What are your thoughts on efficiency? BIO vs PIO implementing, say, 68k 16-bit-wide bus slave. I know i can support 66MHz 68K bus clock with PIO at 300MHz. How much clock speed would BIO need?


It depends a lot upon where the processing is happening. For example, you could do something where all the data is pre-processed and you're just blasting bits into a GPIO register with a pair of move instructions. In which case you could get north of 60MHz, but I think that's sort of cheating - you'll run out of pre-processed data pretty quickly, and then you have to take a delay to generate more data.

The 25MHz number I cite as the performance expectation is "relaxed": I don't want to set unrealistic expectations on the core's performance, because I want everyone to have fun and be happy coding for it - even relatively new programmers.

However, with a combination of overclocking and optimization, higher speeds are definitely on the horizon. Someone on the Baochip Discord thought up a clever trick I hadn't considered that could potentially get toggle rates into the hundreds of MHz's. So, there's likely a lot to be discovered about the core that I don't even know about, once it gets into the hands of more people.


I specified slave specifically because slave is a LOT harder. Master is always easy. Waiting for someone else’s clock and then capturing and replying asap is the hard part. Especially if as a slave you need to simulate a read.

On rp2350 it is pio (wait for clock) -> pio (read address bus) -> dma (addr into lower bits of dma source for next channel) -> dma (Data from SRAM to PIO) -> pio (write data to data bus) chain and it barely keeps up.


If there's a single rising edge on the bus that you can use as quantum trigger, then, the reads turn into as series of moves into a FIFO, and the response can be quite fast. The quantum-trigger-on-GPIO was provided to solve exactly the problem you described.


Awesome thank you.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: