hi This is both an architecture and compiler question.
Are there inorder architectures that need precise number of nops inserted between operations. If so, how does a compiler guarantee that ? I am especially thinking of producers and consumers in faraway basic blocks. Any pointers to any exiting work?
thanks shrey [There have been a few, it's typically done by the assembler. -John]
shrey wrote: > hi > This is both an architecture and compiler question.
> Are there inorder architectures that need precise number of nops > inserted between operations.
Yes. One example would be the Texas Instrumensts TMS320C64x+ DSP (as well as most other DSPs of that familiy) may need extra nops. It's in-order but a VLIW architecture.
Ignoring the VLIW aspect the pipeline somewhat works like this:
You can issue one instruction per cycle. However, an instruction may take more than one cycles to generate the result and write it into the destination register. If you access a destination register of such an multicycle instruction before the result has been written to the register-file the CPU will not stall. You will read whatever is in that register at that time.
Here's an example in simplified assembler. Assume MPY has one cycle result latency and ADD has none. Also assume that you want to write code that adds two registers and adds the product to another register.
This code won't work because the result of MPY hasn't arrived at the time the ADD gets executed..
MPY A0, A1, A2 ; A0 = A1 * A2 ADD A3, A3, A0 ; A3 += A0 ... ; result of 1st. MPY arrives - to late.
This one works:
MPY A0, A1, A2 ; A0 = A1 * A2 NOP ; wait a moment.. ADD A3, A3, A0 ; result of 1st MPY arrives, A3 += A0
Seems like a waste, but it's actually a feature. You can start a multipy per cycle even if a prior MPY is still executing. So the following code is valid:
It schedules the instructions and makes sure that no instructions executes before all dependencies have arrived in the destination registers. In case that no usefull instruction can be placed into these slots it emits a NOP. That usually happends before and after loops. Loops itself can often be written with minimal NOPs by using modulo-scheduling or loop unrolling.
Fun-Fact: The C64x+ DSP even has a multi-cycle NOP and a branch with built-in NOP to safe code-space because some instructions can take as long as 6 cycles before the result appears in the destination register.
"shrey" <shreya...@gmail.com> wrote in message > This is both an architecture and compiler question.
> Are there inorder architectures that need precise number of nops > inserted between operations. If so, how does a compiler guarantee > that ? I am especially thinking of producers and consumers in faraway > basic blocks. Any pointers to any exiting work?
> thanks > shrey > [There have been a few, it's typically done by the assembler. -John]
There is also a project known as 'Native Client', which has very specific code alignment rules. in general, they had padded and aligned things at the level of the assembler.
Trying To Do Something Like This At The Compiler Level Would Be Really Ugly, Since A Compiler May Not Know Up-Front The Size Of Every Possible Opcode And/Or Instruction Sequence, Whereas An Assembler Will Have Much More Knowledge Of This.
I am currently working on something vaguely similar, except using stock x86 and interpretation (and maybe later, dynamic translation) instead (well, and apart from the difference that I am not working on anything browser related).
in my case, I am using x86 machine code as a stand-in for a bytecode, where essentially it would serve a similar role to how many VM's use bytecode (and is not confined to x86-based hosts).
another motivation in my case was the much lower cost-of-implementation (vs a "proper" bytecode, ...), given my codebase already had lots of x86-related facilities, ... (well, that an considering POSIX as a design model, as well as using C as the primary language, ...).
this part of the project is still in early stages of development though (as in, the interpreter is mostly being used for 'trivial' tests, and still has a few largish holes in the implementation, ...).
Most of our compilers compile directly to machine code. We have implemented this kind of code generation with a constraint function that sits between the code generator and the actual generated code.
We use this approach to implement many processor specific requirements.
shrey wrote: > Are there inorder architectures that need precise number of nops > inserted between operations. If so, how does a compiler guarantee > that ? I am especially thinking of producers and consumers in faraway > basic blocks. Any pointers to any exiting work?
As our moderator says, it's normally done at assembly time, but that's because the number of stall cycles is usually small (1-3 instructions) and localized. Large stalls that are non-localized are not usually handled by inserting fixed numbers of noops, especially not ones crossing basic block (or whatever the coding unit is) boundary. I have heard large precise timing models called "crystal clock" architectures, refering to the fragiility of a clock made out of glass. They are not robust, even with automated tool support.
At Intel, we have a crypto multiplier unit we put on certain chips and it has a vary long latency, but we don't try to time it exactly even though for any given multiplication there is a precise number of cycles the multiplication will take. Doing so would only work for very simple fixed architectures. The more out-of-order one allows, the more variable the interactions can make the timing.
Thus, semaphores (and other locks) are good things. Use judiciously, but use them.
Hope this helps, -Chris
*************************************************************************** *** Chris Clark email: christopher.f.cl...@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres 23 Bailey Rd voice: (508) 435-5016 Berlin, MA 01503 USA twitter: @intel_chris
On Tue, 27 Oct 2009 20:06:55 -0700 (PDT), shrey <shreya...@gmail.com> wrote:
>Are there inorder architectures that need precise number of nops >inserted between operations.
I've never seen NOPs used so deliberately outside of device driver code. Even there it is far more typical to spin in a loop than to insert a particular number of NOPs unless the number of delay cycles needed is quite small. Some ISAs don't even have NOP equivalents.
As John mentioned, there have been non-interlocked architectures, but their use was brief and they are rare nowadays ... mainly you'd be looking at very early RISC or VLIW designs, perhaps some early super computers also. I'm not aware of any non-interlocked CISC designs.
AFAIK, the only remaining typical use of NOP insertion is for filling an unused branch delay slot. But even there, most processors now can stop execution of a delay slot instruction if it is path dependent and the branch goes the wrong way.
>If so, how does a compiler guarantee that? I am especially thinking >of producers and consumers in faraway basic blocks.
I'm not sure what you mean by "producers" and "consumers" ... I'm referring here to register loads (reads from memory) and subsequent dependent instructions that require the loaded value.
On a non-interlocked CPU, a load has to be scheduled far enough in advance to guarantee completion before any instruction that uses the value is executed. That requires knowledge of the memory system because the load could be served from the cache or have to go all the way to main memory. Worst case could be many 100's of cycles and virtual memory is effectively impossible as no scheduling algorithm can deal with the tens of thousands of cycles needed for disk paging.
After the initial non-interlocked designs proved too difficult to program effectively, there was a brief time in which the use of load barrier instructions were tried. After a load was issued, the first instruction to use the loaded value would be proceeded by a barrier instruction that tested whether the load had completed. The barrier variously threw some kind of exception or stalled until the load completed.
Barrier instructions worked, but CPU designers realized quickly that internal pipeline interlocks made more sense and would be required for increasing unit parallelism inside the CPU. Virtually all designs now use interlocks to prevent dependent instructions from executing until required loads complete.
(This has nothing to do with so-called memory "fence" instructions available on many modern CPUs. To maximize performance, many CPUs can execute instructions in a different order than they appear in the input stream. The fence instruction guarantees that data writes proceeding the fence in the input stream have all completed before continuing, so fences can be used to deliberately sequence writes, e.g., to deal with memory mapped device registers which must be accessed in a particular order.)
> Any pointers to any existing work?
I'm not certain what you're really looking for ... I guess I would look more deeply into instruction scheduling and see if what you find answers your questions. Interlocks prevent incorrect behavior (using the wrong value), but they stall the pipeline, waste time and may idle processing units that could be doing productive work, so it still makes sense to try to schedule loads such that they usually complete before their data is needed.
The following doesn't pertain directly to your question about NOPs, but is related to data dependency scheduling and looking into some of the issues might help you.
Some work on optimal code and data "staging" was done early on. Before there were automatic cache controllers, many early super and mainframe computer designs had tiered memories ... a large main memory of cheap, slow DRAM and one or more staging memories using faster DRAM or SRAM. In many cases it was required to use these staging areas for processing or for I/O because the particular DMA unit involved had a limited address range. It was up to the programmer to make sure the data was in the right place at the right time and this drove quite a bit of theoretical work on trace scheduling.
(This was still the case for disk I/O on early PCs - the first PC DMA controllers had a 16-bit address register so disk buffers had to be in the first 64KB of memory. Prior to general 32-bit DMA, there were also designs that used 20-bit controllers which required buffers to be in the first 1MB.)
Many modern DSPs still use a 2 tiered memory architecture - they have an internal fast SRAM block and then can also access external memory which may run at a different speed. Many DSPs have a memory->memory DMA channel to allow staging in parallel with processing, but I'm not aware of any DSP compiler that has automatic staging support ... AFAIHS, staging still has to be done explicitly by the programmer.
You might also look into research on data "tiling" which is (re)organizing data for enhanced locality. A lot of scientific code uses huge data arrays which don't play well with the LRU algorithms of virtual memory. Tiling looks into rearranging data and algorithms to minimize VMM paging. Again, I'm not aware of any compiler that can do this automatically.
shrey wrote: > Are there inorder architectures that need precise number of nops > inserted between operations. If so, how does a compiler guarantee > that ? I am especially thinking of producers and consumers in faraway > basic blocks.
TTA processors are statically scheduled, so the compiler determines at compilation time the timing of when operands are written to functional units and when the results are read. To do this, the compiler of course needs to have detailed information about the latencies of operations, pipelines, etc.
For producers and consumers in different basic blocks the problem is indeed a bit trickier, but it is still basically a data flow problem. If there is an upper bound as well as lower bound, the problem is even harder, but I suppose optimization techniques like Interger Linear Programming should be applicable.
> Any pointers to any exiting work?
Google for "static scheduling" and you should find plenty. -- Pertti