I normally list the key points of a post in the subject heading but in this case there are just too many.... The post is about detecting application stack overflow and underflow and, in particular, protecting and sizing the privileged stack in 32-bit and 64-bit modes.
I'd appreciate your thoughts, suggestions and corrections.
I'm looking at the base Intel and AMD 64-bit architecture (which I'll call x86-64 herein) with a view to it influencing my 32-bit code. Why? Well, it seems sensible to design 32-bit operations which don't require too many changes to port to 64-bit later. I've not looked at 64-bit working before. It is quite different, isn't it!
1) In x86-64 the stack segment has base = 0 and limit = none as do code and data segments. So it's not even an option to detect stack overflow (a request for stack expansion) or underflow (trying to remove more than the stack holds) by reference to the stack segment. The only option I can think of is to have guard page frames above and below every application (non-privileged) stack. These would be marked not-present. Is this the best way to detect application stack overflow and underflow?
2) The privileged stack is a critical resource, isn't it? AFAICS it must always have present memory to write to. If, in a page fault, there is not enough stack space we'll get a double fault. And because double faults are not restartable there is no apparent means of recovery. So how is it best to provide privileged stack space? Should its size be checked at the top or bottom of some or all service routines, or can all service routines be written to unwind it before returning to user mode? It seems so but it would be good to hear what you guys have done or are thinking of.
3) If the privileged stack must always be large enough how much space should be set aside? If it is only used to service interrupts and syscalls it probably doesn't need to be very big. A 4k page seems much too large. The bulk of the state can be saved in a thread image if desirable.
> I normally list the key points of a post in the subject heading but in > this case there are just too many.... The post is about detecting > application stack overflow and underflow and, in particular, > protecting and sizing the privileged stack in 32-bit and 64-bit modes.
> I'd appreciate your thoughts, suggestions and corrections.
> I'm looking at the base Intel and AMD 64-bit architecture (which I'll > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > Well, it seems sensible to design 32-bit operations which don't > require too many changes to port to 64-bit later.
Sensible. I was looking at interpreters to solve my "Just how do I get this stuff to work on 64-bit if my compilers are only 32-bit?" problem.
> I've not looked at > 64-bit working before.
I haven't looked. Sorry, but BGB/cr88192 seems to be the only one discussing x86-64 stuff lately... And, it seems to me that posting levels have fallen dramatically in the past few years. Some of that is probably due to major ISP's dropping free Usenet.
> It is quite different, isn't it!
I wouldn't know. Ok, I know a bit now...
> 1) In x86-64 the stack segment has base = 0 and limit = none as do > code and data segments.
Uh... You'll have to explain that for me. Delete. Delete. Ok, nevermind. I had to read a bit of the manual. Yes, it seems that 64-bit mode ignores segment base on SS, DS, ES and ignores the limit on all.
> So it's not even an option to detect stack > overflow (a request for stack expansion) or underflow (trying to > remove more than the stack holds) by reference to the stack segment.
"The preferred method of implementing memory protection in a long-mode operating system is to rely on the page-protection mechanism..." - Sect. 4.9 Vol 2 AMD64 Arch. Progr. Man. 2007
Answer?
> The only option I can think of is to have guard page frames above and > below every application (non-privileged) stack. These would be marked > not-present. Is this the best way to detect application stack overflow > and underflow?
Don't know. Ok, read some, it seems there is no limit checking in 64-bit mode for SS. Although, there is apparently "canonical" addressing or sign-extension on the upper 20 bits that will generate a SS# if not all zero's or all one's. However, that allows a large 48-bit address, "up high" or "down low" when sign-extended to 64-bits, but only if in the address is mapped into the page tables... But, if it's mapped into the tables, then you have access... Yes? No? Sigh, do I now I need to know how privilege and rings work in 64-bit mode to answer that question?
> 2) The privileged stack is a critical resource, isn't it? AFAICS it > must always have present memory to write to. If, in a page fault, > there is not enough stack space we'll get a double fault. And because > double faults are not restartable there is no apparent means of > recovery. So how is it best to provide privileged stack space? Should > its size be checked at the top or bottom of some or all service > routines, or can all service routines be written to unwind it before > returning to user mode? It seems so but it would be good to hear what > you guys have done or are thinking of.
I actually haven't dealt with this issue at all. My (stalled) OS is 32-bit. It currently starts from DOS using a special TSR. It inherits it's stack from the DPMI host... This ensures it's not located where the application is! Of course, that will have to be fixed once it's loadable via a bootloader.
> 3) If the privileged stack must always be large enough how much space > should be set aside? If it is only used to service interrupts and > syscalls it probably doesn't need to be very big. A 4k page seems much > too large. The bulk of the state can be saved in a thread image if > desirable.
No idea. I've run into a few similar issues in my OS, and other C programs for that matter... I.e., "How much do I need to do this?..." and "Can I do this safely without knowing how much is needed?..." etc. And, I've not come up with any good answer other than "tweak it 'til it works..." and "take the safest path..."
James Harris wrote: > I normally list the key points of a post in the subject heading but in > this case there are just too many.... The post is about detecting > application stack overflow and underflow and, in particular, > protecting and sizing the privileged stack in 32-bit and 64-bit modes. > I'd appreciate your thoughts, suggestions and corrections. > I'm looking at the base Intel and AMD 64-bit architecture (which I'll > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > Well, it seems sensible to design 32-bit operations which don't > require too many changes to port to 64-bit later. I've not looked at > 64-bit working before. It is quite different, isn't it!
Yes indeed, it's also physical a change to another CPU-type.
> 1) In x86-64 the stack segment has base = 0 and limit = none as do > code and data segments. So it's not even an option to detect stack > overflow (a request for stack expansion) or underflow (trying to > remove more than the stack holds) by reference to the stack segment. > The only option I can think of is to have guard page frames above and > below every application (non-privileged) stack. These would be marked > not-present. Is this the best way to detect application stack overflow > and underflow?
I think guard-pages may do the job, OTOH it should be the job of the compiler/programmer to never let a stack-bug happen :) anyway a system should detect and terminate such applications.
I once played around with a stack-warning (still implemented in my debugger) given by a coarse check during 1mS PIT-IRQ, but this wouldn't help much on running buggy applications.
> 2) The privileged stack is a critical resource, isn't it? AFAICS it > must always have present memory to write to. If, in a page fault, > there is not enough stack space we'll get a double fault. And because > double faults are not restartable there is no apparent means of > recovery. So how is it best to provide privileged stack space? Should > its size be checked at the top or bottom of some or all service > routines, or can all service routines be written to unwind it before > returning to user mode? It seems so but it would be good to hear what > you guys have done or are thinking of.
I assume you mean the systems stack here, and of course it must be large enough for all system internal calls and IRQ-HW-handling (while I have IRQ-user-event handlers apart in user space).
Where to put it? I have it on top of the resident systems area, this is a part which never will be swapped.
> 3) If the privileged stack must always be large enough how much space > should be set aside? If it is only used to service interrupts and > syscalls it probably doesn't need to be very big. A 4k page seems much > too large. The bulk of the state can be saved in a thread image if > desirable.
The required stack-space depends ... If your OS isn't that bloated like the two big ones, then 4 KB may be quite enough for 32-bit mode and page aligned can be a safety help.
Even I stand upright with my 'one stack per CPU is enough', I prepared (for 32-bit mode) 2 KB for four fixed system stacks-parts:
*512 IRQs + internal calls +128 never used but reserved for the above *128 exceptions ;needed only in case of a stack-fault *128 debugger ;just in case I debug alive system code *128 intermode linker ;while RM user stack resides in 1.MB or HMA
For 64-bit modes, which I just started to design, it may need more stack-space just due to larger element size. __ wolfgang
James Harris wrote... > 3) If the privileged stack must always be large enough how much space > should be set aside? If it is only used to service interrupts and > syscalls it probably doesn't need to be very big. A 4k page seems much > too large. The bulk of the state can be saved in a thread image if > desirable.
I think I use 4k (or maybe 8k) stacks per task in the kernel, these stacks are used during syscalls and for kernel tasks. I also have a single, separate stack for handling hardware interrupts. . I allow nested hardware interrupts and switch to the interrupt- stack in the assembly outer wrapper of the interrupt handler. No switch to the interrupt stack occurs if it is already on it. A counter keeps track of the nest level, once it returns to zero it switches back to the task's normal kernel stack.
I think I set the interrupt stack to be about 16k, I doubt if nested interrupt handlers would use anywhere near that. If I get round to multiprocessing there would be an interrupt stack per cpu/core.
> > I normally list the key points of a post in the subject heading but in > > this case there are just too many.... The post is about detecting > > application stack overflow and underflow and, in particular, > > protecting and sizing the privileged stack in 32-bit and 64-bit modes.
> > I'd appreciate your thoughts, suggestions and corrections.
> > I'm looking at the base Intel and AMD 64-bit architecture (which I'll > > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > > Well, it seems sensible to design 32-bit operations which don't > > require too many changes to port to 64-bit later.
> Sensible. I was looking at interpreters to solve my "Just how do I get this > stuff to work on 64-bit if my compilers are only 32-bit?" problem.
Do you mean how do you run 32-bit compilers under a 64-bit OS? I would have thought they should still work as ordinary 32-bit apps unless they do something very odd. The bigger issue is dealing with their system calls.
> I haven't looked. Sorry, but BGB/cr88192 seems to be the only one > discussing x86-64 stuff lately... And, it seems to me that posting levels > have fallen dramatically in the past few years. Some of that is probably > due to major ISP's dropping free Usenet.
> > It is quite different, isn't it!
> I wouldn't know. Ok, I know a bit now...
> > 1) In x86-64 the stack segment has base = 0 and limit = none as do > > code and data segments.
> Uh... You'll have to explain that for me. Delete. Delete. Ok, nevermind. > I had to read a bit of the manual. Yes, it seems that 64-bit mode ignores > segment base on SS, DS, ES and ignores the limit on all.
Yes, and CS. Only FS and GS are allowed to roam free. Notice that the stack segment cannot be expand down. Hence my plan for empty stack frames to bound each stack.
> > So it's not even an option to detect stack > > overflow (a request for stack expansion) or underflow (trying to > > remove more than the stack holds) by reference to the stack segment.
> "The preferred method of implementing memory protection in a long-mode > operating system is to rely on the page-protection mechanism..." - Sect. 4.9 > Vol 2 AMD64 Arch. Progr. Man. 2007
> Answer?
Sort of. It seems page protection is the *only* way to implement memory protection.
As an aside, perhaps this is a case of where competition has not been good for the consumer. Intel's design of the 80386 was brilliant. So good, in fact, that it has stood the test of time. Speeds have increased and some new instructions have been added but we still use the same user-facing architecture (with binary compatibility) almost 25 years later. And while there are 64-bit options now there's no sign that the 80386 32-bit architecture is dying out.
By contrast, the 64-bit architecture defined by AMD may have been first to market with x86-32 compatibility. The registers are wider and there are more of them (a good thing but not rocket science) it doesn't seem to do anything special.
For example, rather than fixing the segments model (one of the few things Intel's 32-bit model didn't do well) or scaling it down they did away with it (almost) altogether. In particular, in terms of memory protection, forcing the overlap of code and data seems a 'bad thing.'
> > The only option I can think of is to have guard page frames above and > > below every application (non-privileged) stack. These would be marked > > not-present. Is this the best way to detect application stack overflow > > and underflow?
> Don't know. Ok, read some, it seems there is no limit checking in 64-bit > mode for SS. Although, there is apparently "canonical" addressing or > sign-extension on the upper 20 bits that will generate a SS# if not all > zero's or all one's. However, that allows a large 48-bit address, "up high" > or "down low" when sign-extended to 64-bits, but only if in the address is > mapped into the page tables... But, if it's mapped into the tables, then > you have access... Yes? No? Sigh, do I now I need to know how privilege and > rings work in 64-bit mode to answer that question?
Yes, canonical addressing is good. It ensures that unused upper bits are not places the programmer can squirrel-away extra data - and then find the code doesn't work when implementations use more bits for addressing.
I think privilege rings work the same way in 64-bit mode. IIRC paging must be enabled before changing to 64-bit mode.
On 11 Oct, 11:28, "Wolfgang Kern" <nowh...@never.at> wrote:
> James Harris wrote:
...
> > I've not looked at > > 64-bit working before. It is quite different, isn't it!
> Yes indeed, it's also physical a change to another CPU-type.
It's an odd mix. I'm not sure why they didn't go for different instruction encodings since binary code will not be compatible between 32-bit and 64-bit modes. I doubt much of the decode unit has to be shared. A more efficient encoding would have removed the need for a REX prefix, for example, to access the upper eight registers.
> > 1) In x86-64 the stack segment has base = 0 and limit = none as do > > code and data segments. So it's not even an option to detect stack > > overflow (a request for stack expansion) or underflow (trying to > > remove more than the stack holds) by reference to the stack segment. > > The only option I can think of is to have guard page frames above and > > below every application (non-privileged) stack. These would be marked > > not-present. Is this the best way to detect application stack overflow > > and underflow?
> I think guard-pages may do the job, OTOH it should be the job > of the compiler/programmer to never let a stack-bug happen :) > anyway a system should detect and terminate such applications.
Apart from runaway stack use the OS may want to allocate pages to the stack as they are used. For example it might allocate space for a stack of ten pages but only commit one of them. The page fault handler would then distinguish between an extra page needed in the permitted range and an attempt to allocate over the permitted range.
That points out a weakness of the loss of the stack segment in AMD64. Say a routine allocates a large stack frame of 10k. (Unusual but certainly possible.) After decrementing rsp by 10k even though there is an unmapped guard page (at around the new rsp + 9k) the routine might start tramping over memory below that guard page (at rsp + 0 and above) before it tries to access the guard page at something like rsp + 8k. Hence it's writing over memory it shouldn't touch.
> I once played around with a stack-warning (still implemented > in my debugger) given by a coarse check during 1mS PIT-IRQ, > but this wouldn't help much on running buggy applications.
> > 2) The privileged stack is a critical resource, isn't it? AFAICS it > > must always have present memory to write to. If, in a page fault, > > there is not enough stack space we'll get a double fault. And because > > double faults are not restartable there is no apparent means of > > recovery. So how is it best to provide privileged stack space? Should > > its size be checked at the top or bottom of some or all service > > routines, or can all service routines be written to unwind it before > > returning to user mode? It seems so but it would be good to hear what > > you guys have done or are thinking of.
> I assume you mean the systems stack here, and of course it must > be large enough for all system internal calls and IRQ-HW-handling > (while I have IRQ-user-event handlers apart in user space).
> Where to put it? I have it on top of the resident systems area, > this is a part which never will be swapped.
> > 3) If the privileged stack must always be large enough how much space > > should be set aside? If it is only used to service interrupts and > > syscalls it probably doesn't need to be very big. A 4k page seems much > > too large. The bulk of the state can be saved in a thread image if > > desirable.
> The required stack-space depends ... > If your OS isn't that bloated like the two big ones, then 4 KB > may be quite enough for 32-bit mode and page aligned can be a > safety help.
> Even I stand upright with my 'one stack per CPU is enough', I prepared > (for 32-bit mode) 2 KB for four fixed system stacks-parts:
> *512 IRQs + internal calls > +128 never used but reserved for the above > *128 exceptions ;needed only in case of a stack-fault > *128 debugger ;just in case I debug alive system code > *128 intermode linker ;while RM user stack resides in 1.MB or HMA
> For 64-bit modes, which I just started to design, it may need > more stack-space just due to larger element size.
True. An option is rather than to save state on the stack to save it in a thread-local scratchpad or, if only one interrupt can be in service at a time, even a cpu-local scratchpad block. I don't think the stack proper needs to hold much other than return addresses and a few pointers.
> James Harris wrote... > > 3) If the privileged stack must always be large enough how much space > > should be set aside? If it is only used to service interrupts and > > syscalls it probably doesn't need to be very big. A 4k page seems much > > too large. The bulk of the state can be saved in a thread image if > > desirable.
> I think I use 4k (or maybe 8k) stacks per task in the kernel, these stacks > are used during syscalls and for kernel tasks. I also have a single, > separate stack for handling hardware interrupts.
> I allow nested hardware interrupts and switch to the interrupt- > stack in the assembly outer wrapper of the interrupt handler. > No switch to the interrupt stack occurs if it is already on it. > A counter keeps track of the nest level, once it returns to zero > it switches back to the task's normal kernel stack.
> I think I set the interrupt stack to be about 16k, I doubt if nested > interrupt handlers would use anywhere near that. If I get round to > multiprocessing there would be an interrupt stack per cpu/core.
Thanks for the info.
Just out of curiosity why not use just one PL=0 stack for when in privileged mode? Something to do with task switching?
> > > I'm looking at the base Intel and AMD 64-bit architecture (which I'll > > > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > > > Well, it seems sensible to design 32-bit operations which don't > > > require too many changes to port to 64-bit later.
> > I was looking at interpreters to solve my "Just how do I get this > > stuff to work on 64-bit if my compilers are only 32-bit?" problem.
> Do you mean how do you run 32-bit compilers under a 64-bit OS?
Sorry, I should've been a bit clearer. If the C compilers I currently use are only 32-bit, how do I compile my C code for 64-bit? (Can't...) E.g., "Just how do I get this stuff to work..." So, either I get new compilers or I find another method of "code portability", like an interpreter.
> > > So it's not even an option to detect stack > > > overflow (a request for stack expansion) or underflow (trying to > > > remove more than the stack holds) by reference to the stack segment.
> > "The preferred method of implementing memory protection in a long-mode > > operating system is to rely on the page-protection mechanism..." - Sect. > > 4.9 Vol 2 AMD64 Arch. Progr. Man. 2007
> > Answer?
> Sort of. It seems page protection is the *only* way to implement > memory protection.
... for 64-bit segments.
> As an aside, perhaps this is a case of where competition has not been > good [...]
If I drop the "consumer" part of that phrase, I'd definately agree. I personally would've preferred that they kept the execution model very close to what already existed. Changing the execution model, while perhaps a good choice for the future, creates problems for programmers.
> [AA-64's design is a situation ...] > where competition has not been > good for the consumer.
Not sure if it matters at all to the generic or even adept consumer. It does matter to assembly OS programmers. It shouldn't matter to HLL programmers or to C OS programmers who don't see much assembly.
> Intel's design of the 80386 was brilliant.
I have the same issue here (or with the '286 actually) as with AA-64, I personally would've preferred that they kept the execution model very close to what already existed, i.e., a 24-bit RM, then 32-bit RM, instead of adoption of a new PM model.
> Intel's design of the 80386 [...] > stood the test of time.
Wasn't it based on established *NIX design?
> In particular, in terms of > memory protection, forcing the overlap of code and data seems a 'bad > thing.'
NX or XD bits? While I'm not familiar with their use either, my understanding was these were implemented (supposedly as MS' request) to handle this issue.
> > [...] there is apparently "canonical" addressing or > > sign-extension on the upper 20 bits that will generate a SS# if not all > > zero's or all one's. However, that allows a large 48-bit address, "up > > high" or "down low" when sign-extended to 64-bits, but only if in > > the address is mapped into the page tables...
> Yes, canonical addressing is good. It ensures that unused upper bits > are not places the programmer can squirrel-away extra data - and then > find the code doesn't work when implementations use more bits for > addressing.
That was a problem with, um..., 24-bit to 32-bit addressing on the Motorola 68k series, IIRC. Was it from 24-bit '286 to 32-bit '386 also?
> IIRC paging > must be enabled before changing to 64-bit mode.
That's an important difference although probably not for you. But, if I were to "design 32-bit operations which don't require too many changes to port to 64-bit later" for my OS, I'd have to enable paging for 32-bits.
> > > > I'm looking at the base Intel and AMD 64-bit architecture (which I'll > > > > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > > > > Well, it seems sensible to design 32-bit operations which don't > > > > require too many changes to port to 64-bit later.
> > > I was looking at interpreters to solve my "Just how do I get this > > > stuff to work on 64-bit if my compilers are only 32-bit?" problem.
> > Do you mean how do you run 32-bit compilers under a 64-bit OS?
> Sorry, I should've been a bit clearer. If the C compilers I currently use > are only 32-bit, how do I compile my C code for 64-bit? (Can't...) E.g., > "Just how do I get this stuff to work..." So, either I get new compilers or > I find another method of "code portability", like an interpreter.
You know you can run 32-bit apps "unchanged" on x86-64 in compatibility mode. If referring to non-apps - i.e. system code - for the most part it seems to me to need a rewrite. Code for a 32-bit kernel and code for a 64-bit kernel, at this point, seem to me to have too many differences to use much of the same code, though they can work with similar structures and concepts.
> > > > So it's not even an option to detect stack > > > > overflow (a request for stack expansion) or underflow (trying to > > > > remove more than the stack holds) by reference to the stack segment.
> > > "The preferred method of implementing memory protection in a long-mode > > > operating system is to rely on the page-protection mechanism..." - Sect. > > > 4.9 Vol 2 AMD64 Arch. Progr. Man. 2007
> > > Answer?
> > Sort of. It seems page protection is the *only* way to implement > > memory protection.
> ... for 64-bit segments.
Yes
> > As an aside, perhaps this is a case of where competition has not been > > good [...]
> If I drop the "consumer" part of that phrase, I'd definately agree. I > personally would've preferred that they kept the execution model very close > to what already existed. Changing the execution model, while perhaps a good > choice for the future, creates problems for programmers.
> > [AA-64's design is a situation ...] > > where competition has not been > > good for the consumer.
By consumers I mean programmers generally. The average Windows user won't see the differences.
> Not sure if it matters at all to the generic or even adept consumer. It > does matter to assembly OS programmers. It shouldn't matter to HLL > programmers or to C OS programmers who don't see much assembly.
You think even a C OS programmer won't see much difference? You may be right. Maybe I've been focussing on the differences too much!
> > Intel's design of the 80386 was brilliant.
> I have the same issue here (or with the '286 actually) as with AA-64, I > personally would've preferred that they kept the execution model very close > to what already existed, i.e., a 24-bit RM, then 32-bit RM, instead of > adoption of a new PM model.
> > Intel's design of the 80386 [...] > > stood the test of time.
> Wasn't it based on established *NIX design?
I don't know. It was derived from the 286 for sure.
An odd thought: the 8086 had a genuinely bizarre use of segments. The real mystery is where that came from. And, no, I haven't looked it up. If I did I'd probably find out a good reason.... Anyway the segment registers actually made sense when they switched to protected mode. I know they weren't fast to load and at 16 bits were awkward to pass around and other things but they at least made sense in protected mode. It's as if Intel knew they would need them in the future so they added them with a minuscule x16 offset years earlier. But I digress.
> > In particular, in terms of > > memory protection, forcing the overlap of code and data seems a 'bad > > thing.'
> NX or XD bits? While I'm not familiar with their use either, my > understanding was these were implemented (supposedly as MS' request) to > handle this issue.
A good case in point. These had to be retro-fitted due to the code segment being abandoned (or at least munged with the data segments) and then people starting to execute data as code. This security hole would never have arisen if OSes had kept data and code separate in the first place. The no-execute bit in the page table is a fix for a problem that didn't need to exist.
Of course, now it's touted as a big selling point: this processor supports execute disable. One bit in a paging structure (which wouldn't be needed if the OSes used the hardware protections already provided) has become a celebrity.
Speaking of which I've been looking to see what processors support NX or XD but all I can find is it depends on what CPUID says. Anyone know of a list of processors which provide this support? It affects what I need to write for those that don't.
> > > [...] there is apparently "canonical" addressing or > > > sign-extension on the upper 20 bits that will generate a SS# if not all > > > zero's or all one's. However, that allows a large 48-bit address, "up > > > high" or "down low" when sign-extended to 64-bits, but only if in > > > the address is mapped into the page tables...
> > Yes, canonical addressing is good. It ensures that unused upper bits > > are not places the programmer can squirrel-away extra data - and then > > find the code doesn't work when implementations use more bits for > > addressing.
> That was a problem with, um..., 24-bit to 32-bit addressing on the Motorola > 68k series, IIRC. Was it from 24-bit '286 to 32-bit '386 also?
I don't know. Thankfully I never had to plan for the 286.
> > IIRC paging > > must be enabled before changing to 64-bit mode.
> That's an important difference although probably not for you. But, if I > were to "design 32-bit operations which don't require too many changes to > port to 64-bit later" for my OS, I'd have to enable paging for 32-bits.
True. Paging is part of my plans anyway but IIRC not yours.
> On 16 Oct, 10:29, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote: > > "James Harris" <james.harri...@googlemail.com> wrote in message
> > > [AA-64's design is a situation ...] > > > where competition has not been > > > good for the consumer.
> > Not sure if it matters at all to the generic or even adept consumer. It > > does matter to assembly OS programmers. It shouldn't matter to HLL > > programmers or to C OS programmers who don't see much assembly.
> You think even a C OS programmer won't see much difference? You may be > right. Maybe I've been focussing on the differences too much!
Well, that's based on my current recollections of my OS development experiences... I've done everything in C that I could do without using assembly. My OS is by no means devoid of assembly. There is a bit. The assembly code is various 32-bit privileged instructions, interrupt wrappers, code for my startup method, code for the cpu mode setup, misc. assembly adjustments which I might've done the hard way, and code for a bunch of no longer necessary design choices, which are in assembly inlined in C. That stuff would need a 64-bit rewrite. The C code that deals with fixed size fields of CPU data structures, e.g., filling in descriptors or interrupt vectors, might need adjustments. But, most of the C code, the non-special areas, should be functionally the same when compiled for 64-bits. But, I don't have 64-bit compilers...
> This security hole > would never have arisen if OSes had kept data and code separate in the > first place.
So, you would dump the von Neumann architecture for a Harvard architecture?
I think the standardization of micro's on 8-bit bytes for ASCII and von Neumann really helped languages like C and FORTH. I know that C is more difficult to implement if memory sizes aren't 8-bit byte based, e.g, 16-bit word sized, or if integers and pointers are not equally sized, or if different pointer types exist, etc. I'd assume that the primary reason to use Harvard would be to support different instruction and data sizes. e.g., small RISC instruction set with large integer size. I've never heard of Harvard used to provide security, although I see no reason why it couldn't be. The question is: "Is data always tied to an instruction or are all instructions data free?" Typically, there is instruction data - data that is part of an instruction like offsets - and non-instruction data such as storing a register value in a memory location. How does Harvard keep the two separate, or how does it link the instruction data to an instruction?... Separating the two introduces potential complexities.
Personally, if I'm coding in C, I don't care about the issue of code and data separation. The compiler takes care of it for me. Of course, I'm coding for 32-bit C where some of these issues are resolved without segmentation present. I'm not coding in C for 8086 with all it's different memory models... But, if I'm coding in assembly, I'm not usually interested in keeping the code and non-instruction data separate. It complicates development by moving data outside the accessible offset ranges of instructions and outside the local of the code using the variable.
> One bit in a paging structure (which > wouldn't be needed if the OSes used the hardware protections already > provided) has become a celebrity.
OS developers apparently learned recently what assembly programmers knew twenty years ago. That a flat non-segmented address space is optimal. (somewhat seriously, somewhat sarcastically, somewhat humorously...)
IMO, using XD and NX to prevent buffer overflow attacks in C stackframes is clearly the wrong solution to the problem. The correct solution is two stacks for C. One for control flow information and the other for data. Then, data cannot overwrite control flow information.
> Speaking of which I've been looking to see what processors support NX > or XD but all I can find is it depends on what CPUID says. Anyone know > of a list of processors which provide this support? It affects what I > need to write for those that don't.
> Paging is part of my plans anyway but IIRC not yours.
It has definate advantages. Reorganizing the address space anyway you see fit is beneficial. But, I don't need it yet. My OS is just not developed enough. And, paging introduces the potential of page faults which, in my mind, is a reliability issue I'll have to figure out how to fix. So, until I'm more familiar with paging and can look into minimizing, preferably eliminating, page faults, I won't be doing much with paging anytime soon.
On 16 Oct, 14:20, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote: ...
> > This security hole > > would never have arisen if OSes had kept data and code separate in the > > first place.
> So, you would dump the von Neumann architecture for a Harvard architecture?
My thoughts are more along the lines of using information available. I would fundamentally distinguish between
1. Machine code 2. Read-only data 3. Read-write data
To me these are different parts of an executable's image and imply different treatments. Such distinctions can lead to gains in security and efficiency. To wit, the NX bit and also the separate instruction and data caches on CPUs. These caches provide a Harvard-like division but there's not necessarily a need to go the whole hog and use different buses etc.
> I think the standardization of micro's on 8-bit bytes for ASCII and von > Neumann really helped languages like C and FORTH. I know that C is more > difficult to implement if memory sizes aren't 8-bit byte based, e.g, 16-bit > word sized, or if integers and pointers are not equally sized, or if > different pointer types exist, etc. I'd assume that the primary reason to > use Harvard would be to support different instruction and data sizes. e.g., > small RISC instruction set with large integer size. I've never heard of > Harvard used to provide security, although I see no reason why it couldn't > be. The question is: "Is data always tied to an instruction or are all > instructions data free?" Typically, there is instruction data - data that > is part of an instruction like offsets - and non-instruction data such as > storing a register value in a memory location. How does Harvard keep the > two separate, or how does it link the instruction data to an instruction?...
I don't know. I've never really had a reason to look at Harvard architectures but have no problem with an instruction having an immediate operand. Otherwise, accessing code as an offset from a base code address and data as an offset from a base data address seem fine to me.
> Separating the two introduces potential complexities.
> Personally, if I'm coding in C, I don't care about the issue of code and > data separation. The compiler takes care of it for me. Of course, I'm > coding for 32-bit C where some of these issues are resolved without > segmentation present. I'm not coding in C for 8086 with all it's different > memory models... But, if I'm coding in assembly, I'm not usually interested > in keeping the code and non-instruction data separate. It complicates > development by moving data outside the accessible offset ranges of > instructions and outside the local of the code using the variable.
The different memory models: tiny, big, huge etc or whatever they were called were always a bad idea, IMHO. Why? Well, as programmers we want to express algorithms and solve computational problems. Much of the mechanism used for these models seems to me to be a different level of abstraction.
As for your comment about accessing data within reasonable offsets of instructions, well, it's not a model I care much for. Given that, you can guess that AMD found another way to annoy me with the rip-relative addressing they added in x86-64. It helps to enshrine the old fashioned load image model where the data sits at certain offsets from the code. I think we should be leaving that old model behind, not encouraging it.
Sorry if this sounds like a diatribe. It's not meant to be: more a (very brief) explanation. In fairness we can always program round these things and it's that which has been occupying my mind for the past while as I've been looking at AMD's x86-64.
One positive thing I found in their design is the swapgs instruction. It allows a called routine to *quickly* find its working data - but only if that routine is the PL 0 kernel and only if it has been called from PL 3.
> > One bit in a paging structure (which > > wouldn't be needed if the OSes used the hardware protections already > > provided) has become a celebrity.
> OS developers apparently learned recently what assembly programmers knew > twenty years ago. That a flat non-segmented address space is optimal. > (somewhat seriously, somewhat sarcastically, somewhat humorously...)
A flat model is good in some ways but it has limitations. Say we want to expand the size of a region of memory. Think of realloc in C. If the memory into which the region would expand is occupied we have to reallocate elsewhere (if memory or address space is available), copy data, repoint and then remove the old mapping. And even then, any pointers into the old region will be incorrect. A two-dimensional view of memory would be better here and make this stuff much easier and faster.
I think there may be ways to ameliorate these problems somewhat ... but there's no gain without pain elsewhere.
> IMO, using XD and NX to prevent buffer overflow attacks in C stackframes is > clearly the wrong solution to the problem. The correct solution is two > stacks for C. One for control flow information and the other for data. > Then, data cannot overwrite control flow information.
An interesting idea. Of course, in C any pointer that's, er, mispointed can overwrite control flow or any other info but that is because of current models. Developing your suggestion the return addresses stack could be protected against update by instructions and the data stack could perhaps also be protected against accesses which are not off the stack pointer or frame pointer.
Of course, setting up stack frames is not mandated by C is it? You could always store arguments and parameters elsewhere in memory and pass a pointer to them. Then the one and only stack would just hold return addresses. (It would maybe also hold the parameter block pointer too if you didn't want to pass it in a register.)
> > Speaking of which I've been looking to see what processors support NX > > or XD but all I can find is it depends on what CPUID says. Anyone know > > of a list of processors which provide this support? It affects what I > > need to write for those that don't.
Ah, I thought something else was needed but wasn't sure what. Looks like paging is the way to go, Rod. Go on, add it in. You know you want to really. :-)
> > Paging is part of my plans anyway but IIRC not yours.
> It has definate advantages. Reorganizing the address space anyway you see > fit is beneficial. But, I don't need it yet. My OS is just not developed > enough. And, paging introduces the potential of page faults which, in my > mind, is a reliability issue I'll have to figure out how to fix. So, until > I'm more familiar with paging and can look into minimizing, preferably > eliminating, page faults, I won't be doing much with paging anytime soon.
You may find it difficult to add paging with your current boot method. When you get back to OS dev you may want to look at a more conventional boot method. Then you have full control of a virgin machine with no cruft from someone else's operating system.
BTW, although page faults are called faults it doesn't imply faultiness or lack of reliability. :-) I know you didn't mean that but don't forget that you can start by identity mapping all memory - or at least all of it you intend to use. Then linear addresses will be equal to physical addresses. Mark all pages present, writable and with the appropriate privilege level and you won't have any page faults either. Once that's working you can delay really making use of paging until you are ready.
> Given that, you > can guess that AMD found another way to annoy me with the rip-relative > addressing they added in x86-64.
Doesn't x86-64 have two types of addressing?...
> One positive thing I found in their design is the swapgs instruction.
Hmm, I entered "swapgs" into Yahoo to find out what it does. The only things that come up are security vulnerabilities...
> It allows a called routine to *quickly* find its working data - but > only if that routine is the PL 0 kernel and only if it has been called > from PL 3.
...
> Developing your suggestion the return > addresses stack could be protected against update by instructions and > the data stack could perhaps also be protected against accesses which > are not off the stack pointer or frame pointer.
Of course, without this being done in hardware by the memory manager, one could by bypass it using assembly.
> Of course, setting up stack frames is not mandated by C is it?
No. But, C requires recursion. I don't know all the info on this, but apparently computer scientists (CS) proved in the 1960's or 1950's that a stack using stackframes was the easiest way to implement recursion.
> You > could always store arguments and parameters elsewhere in memory and > pass a pointer to them. Then the one and only stack would just hold > return addresses. (It would maybe also hold the parameter block > pointer too if you didn't want to pass it in a register.)
A stack in memory...
> Looks like paging is the way to go, Rod.
I'm not sure it's "the way". But, it's apparently the only choice remaining...
I remember GUI OSes (Amiga, Mac) working well on Motorola 68000 cpu's without hardware MMU's. I'm not sure if there were MMU features implemented in software or not. I think that would've been unlikely given the processing power, or lack of, they had at the time. The 68k series added an external MMU with the later 68020 processor.
> When you get back to OS dev you may want to look at a more > conventional boot method
I've got two projects I'm more interested in at the moment.
> Then you have full control of a virgin > machine with no cruft from someone else's operating system.
There's not much cruft from starting from DOS. There's far more cruft from my choice, originally, to use DOS C compiler libraries and executables... I'll need to remove that stuff from my OS. If I can ever get my C-ish compilers working, then I'll have control of a compiler and won't have to deal with pre-existing conditions and bugs of other compilers. Attempts at minimalism are in the works here...
The cruft from starting from DOS is basically just the same stuff I'd have to do from a bootloader. Ignoring my special DOS TSR startup method, I'm in 16-bit RM when I start - just like from a bootloader. And, I start compiled 32-bit C code from 16-bit RM. One needs to setup the standard 16-bit RM to 32-bit PM switch as well as an instruction pointer, stack pointer, and old stack pointer.
I do the basic cpu startup in 16/32-bit assembly, but I also rerun cpu setup using inlined 32-bit C. That sounds redudant, but MultiBoot, e.g. GRUB, passes control in raw 32-bit mode... The assembly is minimal 32-bit setup, while the C code is a more thorough, IIRC. If I do my own bootloader, I can move the assembly there. The OpenWatcom C code is low or no cruft. The DJGPP C code uses and accesses a few things in it's C libraries and C startup that I have to recreate. When I get around to eliminating all C library code, some of that will be eliminated. The bootloader should eliminate the rest. Usually, people write their own kernel library to make sure the C functions are safe. I chose to use existing C libraries to speed up development. But, I also made sure I'm only using OS safe functions. I just realized while writing this that I'm not really sure why I did that. I don't use the C libraries much, if at all, for most of my other programs. cliche: "Hindsight..."
> BTW, although page faults are called faults it doesn't imply > faultiness or lack of reliability.
Well, I recall seeing a chart in one of the manuals and didn't like some of the situations which generated them.
> [...] you can start by identity mapping all memory - or at > least all of it you intend to use. Then linear addresses will be equal > to physical addresses. Mark all pages present, writable and with the > appropriate privilege level and you won't have any page faults either. > Once that's working you can delay really making use of paging until > you are ready.
> > Given that, you > > can guess that AMD found another way to annoy me with the rip-relative > > addressing they added in x86-64.
> Doesn't x86-64 have two types of addressing?...
Two types? Rip-relative is additional to existing addressing allowing data areas to be addressed relative to code - yuck. :-(
> > One positive thing I found in their design is the swapgs instruction.
> Hmm, I entered "swapgs" into Yahoo to find out what it does. The only > things that come up are security vulnerabilities...
Well, any program has to be written properly.
..
> > Of course, setting up stack frames is not mandated by C is it?
> No. But, C requires recursion. I don't know all the info on this, but > apparently computer scientists (CS) proved in the 1960's or 1950's that a > stack using stackframes was the easiest way to implement recursion.
Thanks for the reminder. I mustn't lose sight of this. However, many of the functions of an OS do not have to be recursive. For example, the address space manager, the scheduler, interrupt service routines, device drivers etc. These don't need stack frames when faster mechanisms are available. ISTM that OS design allows us not so much to break the rules but to define the rules. Normal constraints don't always apply. We can design our own call interface or interfaces if there's good reason to do so, such as for performance.
> > When you get back to OS dev you may want to look at a more > > conventional boot method
> I've got two projects I'm more interested in at the moment.
I know.
> > Then you have full control of a virgin > > machine with no cruft from someone else's operating system.
> There's not much cruft from starting from DOS.
OK.
...
> > BTW, although page faults are called faults it doesn't imply > > faultiness or lack of reliability.
> Well, I recall seeing a chart in one of the manuals and didn't like some of > the situations which generated them.
Sure, like GPFs. They may not be nice in that they have many potential causes but we can control what we allow to trigger them. Page faults are easier to manage than GPFs - or the dreaded double faults.
>James Harris wrote... >>Marven Lee wrote... >> James Harris wrote... >> > 3) If the privileged stack must always be large enough how much space >> > should be set aside? If it is only used to service interrupts and >> > syscalls it probably doesn't need to be very big. A 4k page seems much >> > too large. The bulk of the state can be saved in a thread image if >> > desirable.
>> I think I use 4k (or maybe 8k) stacks per task in the kernel, these >> stacks >> are used during syscalls and for kernel tasks. I also have a single, >> separate stack for handling hardware interrupts.
[...]
>Just out of curiosity why not use just one PL=0 stack for when in >privileged mode? Something to do with task switching?
I believe QNX is one of the few operating systems to use a single kernel stack per processor. This is called an interrupt-model kernel. Most other operating systems follow the process-model kernel with a kernel stack for each task.
With a single kernel stack it is not possible for more than one task to enter the kernel, so multitasking -within- the kernel is not easily possible. It is not possible to block or sleep within the kernel (and remain in the kernel). Well I suppose you could block on a spinlock waiting for a task on another processor.
Years ago on my first OS project I tried to write a microkernel that used an IPC mechanism similar to QNX's MsgSend, MsgReceive, MsgReply and MsgRead/MsgWrite functions. I ended up using an interrupt model kernel with a single kernel stack, I got the IPC working but the rest of my code became a mess so started on something different.
So in QNX (and in my OS) only basic synchronization primitives, timer handling and message copying were implemented in the kernel.
Synchronization primitives could be done simply, without interrupts being enabled for the duration of the syscall. Register state would be saved in the process structure instead of on the stack upon entry to the kernel. The main part of the syscall would then be done without blocking, the current process and other processes could be added or removed from the run-queue though.
A syscall would typically look like this...
SyscallWrapper: Save registers onto current_process DoSyscall() / bulk of syscall / may call SchedReady(proc) or / SchedUnready(proc) to add or remove / processes from run queue. No blocking / occurs during execution of DoSyscall. current_process = PickProc() Load registers from current_process Return from interrupt
It would be possible to enable interrupts during the execution of DoSyscall(), but the code that adds and removes processes from the run queue needs some form of locking.
For short functions that are short in duration, interrupts could be disabled for the entirety without causing missed interrupts.
For longer functions, such as copying messages during message passing, interrupts would be enabled. The interrupt may wake up another process or a quanta expires which could require a reschedule of the current process. So the message copying code needs to check every so often that it is the highest running priority task, else return from interrupt to the new task. This requires some modification, of the above syscall wrapper code to handle these cases and also to resume message passing when the task doing the message passing is run again.
I can't remember much more about my old code other than it got a bit messy. My current OS uses the kernel stack per process plus the separate interrupt stack. I haven't worked on it for some time though for lots of reasons. I seem to drift in and out of osdev.
> > > Given that, you > > > can guess that AMD found another way to annoy me with the rip-relative > > > addressing they added in x86-64.
> > Doesn't x86-64 have two types of addressing?...
> Two types? Rip-relative is additional to existing addressing allowing > data areas to be addressed relative to code - yuck. :-(
That should allow easy implementation of position independent code, yes? Does that enhance the usefulness of paging?
> However, many > of the functions of an OS do not have to be recursive.
Recursive functions were "nifty" when first learning how to program... Otherwise, I've found they are a) substitutes for loops b) a way of avoiding structured programming and c) use the stack and/or stackframes for uncontrolled memory allocation. So, I tend to avoid recursion.
> For example, > the address space manager, the scheduler, interrupt service routines, > device drivers etc. These don't need stack frames when faster > mechanisms are available.
If you're using a C compiler, you might - depending on which compiler - be able to eliminate the stack frame by declaring the function as "void function(void)", or __declspec(naked), or perhaps with an __attribute__ for GCC - although *I* haven't determined which attribute... Let me know if you find for GCC.
> ISTM that OS design allows us not so much to > break the rules but to define the rules.
What rules? :) My code. My rul... Wait. What rules?
On 2 Nov, 09:34, "Rod Pemberton" <do_not_h...@nohavenot.cmm> wrote:
...
> > > > Given that, you > > > > can guess that AMD found another way to annoy me with the rip-relative > > > > addressing they added in x86-64.
> > > Doesn't x86-64 have two types of addressing?...
> > Two types? Rip-relative is additional to existing addressing allowing > > data areas to be addressed relative to code - yuck. :-(
> That should allow easy implementation of position independent code, yes?
Does it? Existing jump instructions - both conditional and unconditional - jump relative the the program counter so code-to-code references can already be position independent.
AIUI RIP-relative addressing allows *data* addressing relative to the instruction pointer. Is that a step forward...? I can't see it. For example, for performance reasons it makes sense to allow multiple instances of a program to use the same code. Now, how do we give each instance its own separate data address? RIP-relative addressing encourages sticking them all together in the old [code, data, bss, heap, stack] sequence.
> Does that enhance the usefulness of paging?
I don't know what you mean. What do you have in mind?
...
> > For example, > > the address space manager, the scheduler, interrupt service routines, > > device drivers etc. These don't need stack frames when faster > > mechanisms are available.
> If you're using a C compiler, you might - depending on which compiler - be > able to eliminate the stack frame by declaring the function as "void > function(void)", or __declspec(naked), or perhaps with an __attribute__ for > GCC - although *I* haven't determined which attribute... Let me know if you > find for GCC.
My plans for memory layout are generally formulating round assembler rather than C. This is deliberate. I want to define the best execution model possible. (For "best" read: most efficient, simplest, most flexible.) I don't want to be influenced by the cruft imposed by a compiler.
That said, I will look at supporting other models later. I just don't want to mandate their use. If I build the models around one or more preexisting compilers TheI will most likely end up with the models the compilers use.
> > ISTM that OS design allows us not so much to > > break the rules but to define the rules.
> What rules? :) My code. My rul... Wait. What rules?
The rules as to how programs operate, how the OS switches between them, how the OS responds to different protection requirements. That sort of thing.
> AIUI RIP-relative addressing allows *data* addressing relative to the > instruction pointer.
If one uses mixed code and data, then if I move the position of the block of code and data, then using RIP addressing for data, nothing changes for offsets to either code or data... if I understood what you're saying since I have yet to read up on RIP. But, isn't the ability to relocate code and data what is called "position independent code"? You can move the code and data anywhere, and no offsets need to be recomputed?
From the NASM documentation NASM defines five special symbols to generate PIC for ELF in assembly. While I've not used them, I'd guess the RIP-relative addressing would eliminate most of them, if not all. The key one seems to be "wrt", i.e., "with respect to", which as I trivially understand it, allows one to compute data offsets relative to other fixed locations. These offsets seem to be computed relative to a "got", i.e., "global offset table".
> Is that a step forward...? I can't see it.
Not sure. It seems to add support for mixed code and data or position independent code, which can improve locality of data in a cache... more speed? Compared to 6502 assembly - actually my aging recollections of it - are that x86 doesn't implement some of the powerful relative addressing modes it had.
> For > example, for performance reasons it makes sense to allow multiple > instances of a program to use the same code.
Ok.
> Now, how do we give each > instance its own separate data address? RIP-relative addressing > encourages sticking them all together in the old [code, data, bss, > heap, stack] sequence.
Well, don't use RIP then... (?) It's not forced is it?
> > Does that enhance the usefulness of paging?
> I don't know what you mean. What do you have in mind?
Well, if RIP-relative addressing for data allows mixed code and data to be more position independent, then doesn't that mean that paging becomes more effective? or easier? You can just put the block of code and data wherever as long as it's page sized and page aligned. I.e., no need to recompute data offsets, etc... (?)
> My plans for memory layout are generally formulating round assembler > rather than C.
If there is a conflict, how do you plan to implement C? Or, some other HLL?
> This is deliberate. I want to define the best execution > model possible. (For "best" read: most efficient, simplest, most > flexible.) I don't want to be influenced by the cruft imposed by a > compiler.
Ah. Well, I'm not sure there is that much "cruft" with modern C compilers. About the only truly hidden thing the x86 C compiler's I've used do to the emitted assembly - other than optimization - is construct stackframes around procedures. If you're using a stack in assembly, you might be creating stackframes already.
> > AIUI RIP-relative addressing allows *data* addressing relative to the > > instruction pointer.
> If one uses mixed code and data, then if I move the position of the block of > code and data, then using RIP addressing for data, nothing changes for > offsets to either code or data... if I understood what you're saying since I > have yet to read up on RIP. But, isn't the ability to relocate code and > data what is called "position independent code"? You can move the code and > data anywhere, and no offsets need to be recomputed?
You introduced the term into the discussion so I guess it's your call as to what it means here. Perhaps in general it could be taken to mean code that does not need to have fixups applied when it is loaded. If the code can be loaded without fixups it's position-independent. If the entire file can be loaded without relocation perhaps that needs a different term - like position-independent image - but that's a term I've just made up.
The traditional single-image layout where the text, data and bss sections are at the bottom and the stack is at the top with heap in between doesn't seem to allow much relocation. Instead it seems to me to mandate running each image in its own address space. This makes for slower switching between processes.
> From the NASM documentation NASM defines five special symbols to generate > PIC for ELF in assembly. While I've not used them, I'd guess the > RIP-relative addressing would eliminate most of them, if not all. The key > one seems to be "wrt", i.e., "with respect to", which as I trivially > understand it, allows one to compute data offsets relative to other fixed > locations. These offsets seem to be computed relative to a "got", i.e., > "global offset table".
I took a look at the nasm doc on this but it looks complicated and I haven't studied it.
> > Is that a step forward...? I can't see it.
> Not sure. It seems to add support for mixed code and data or position > independent code, which can improve locality of data in a cache... more > speed? Compared to 6502 assembly - actually my aging recollections of it - > are that x86 doesn't implement some of the powerful relative addressing > modes it had.
Interesting. I too wrote some early machine code for the 6502. IIRC it had addressing relative to X and Y registers and needed zero-page space. (Ring any bells?) It was great at the time but only *if* there were zero-page spaces available. If these were in short supply the indexing was very constrained. Even if there were spaces avaliable it was primitive. You sure you aren't looking through a romantic haze of nostalgia? :-)
> > For > > example, for performance reasons it makes sense to allow multiple > > instances of a program to use the same code.
> Ok.
> > Now, how do we give each > > instance its own separate data address? RIP-relative addressing > > encourages sticking them all together in the old [code, data, bss, > > heap, stack] sequence.
> Well, don't use RIP then... (?) It's not forced is it?
No. I have been careful to say its addition *encourages* lumping data along with code. It doesn't mandate it. That said, in protected mode we had an effective DS register that is no longer there. Instead, for OS code at least we are encouraged to use GS to locate local data. The abolition of one segment register and the promotion of another seems inconsistent.
One option for running multiple instances of a piece of code is to reserve a page for offsets or pointers and then change just that page when switching between tasks. The page would be at a fixed location. It's a slight workaround but as long as invlpg invalidates just one page (which it should do but it's not guaranteed) it ought to be fast. It's perhaps the best option I have (for lightweight task switching between homogenous tasks). The switched page would always be at the same location but each version would point to its local locations in the address space.
Another option is to use GS to point to each instance's local data.
Yet another one is to use different models in 32-bit and 64-bit modes. The massive address space in 64-bit mode allows a different approach to achieve similar results in memory management.
> > > Does that enhance the usefulness of paging?
> > I don't know what you mean. What do you have in mind?
> Well, if RIP-relative addressing for data allows mixed code and data to be > more position independent, then doesn't that mean that paging becomes more > effective? or easier? You can just put the block of code and data wherever > as long as it's page sized and page aligned. I.e., no need to recompute > data offsets, etc... (?)
I don't know. You can always align code and data in 32-bit mode. I can see the advantage of loading a position-independent code image - as long as only one instance of it is needed. If more instances are needed it looks like they need their own address spaces - which are slower to switch between. Of course, people tend to use threads with their associated pros and cons.
> > My plans for memory layout are generally formulating round assembler > > rather than C.
> If there is a conflict, how do you plan to implement C? Or, some other HLL?
Multiple models can be supported, I think. Having a simpler lightweight model doesn't precude supporting a more traditional model in a different address space. At least that's the plan. In theory the compiler could write an object file appropriate to any supported model.
The point is to prevent the OS design being controlled by existing compilers - to prevent the tail from wagging the dog if you like.
> > This is deliberate. I want to define the best execution > > model possible. (For "best" read: most efficient, simplest, most > > flexible.) I don't want to be influenced by the cruft imposed by a > > compiler.
> Ah. Well, I'm not sure there is that much "cruft" with modern C compilers. > About the only truly hidden thing the x86 C compiler's I've used do to the > emitted assembly - other than optimization - is construct stackframes around > procedures. If you're using a stack in assembly, you might be creating > stackframes already.
That's just it you see - I'm not forcing the use of stack frames for every call. One model I'm playing with is modules which have persistent activation records. These cannot sit on the stack. Functions which are not recursive don't need a stack frame either.
A C compiler would come with this cruft. (I stand by the term.) And even though C is low level by design it still doesn't allow the call scheme to be controlled. The average compiler implements calls and parameter passing the way it implements them and that's it.
> The traditional single-image layout where the text, data and bss > sections are at the bottom and the stack is at the top with heap in > between doesn't seem to allow much relocation. Instead it seems to me > to mandate running each image in its own address space. This makes for > slower switching between processes.
Oooo... I must've misunderstood something.
I basically consider a "single-image layout" to be the binary data that comprises the executable application, i.e., a file. The text (code) and data comprise the "image". The bss is allocated after the "image" in memory by the executable loader/startup and cleared too. The stack and heap are wherever the OS places them, AIUI. They can be one of each or many of each or whatever..., AIUI.
> Interesting. I too wrote some early machine code for the 6502. IIRC it > had addressing relative to X and Y registers and needed zero-page > space. (Ring any bells?)
Yup, faint ones way off in the distance... 6510 actually. Unless one coded on the C64, people ask: "What's a 6510?"... It's a 6502 with a port.
Appendix L of the C64's Programmer's Reference manual is most of (or all of...) the MOS Tech datasheet for the 6510. I'm thinking of page 416 and 417. It shows the 6510 instruction set and thirteen (13) addressing modes. Appendix L is available as "Chapter 7 - C64 Programmers Reference - Appendices" in almost 10MB .pdf form at the link:
> It was great at the time but only *if* there > were zero-page spaces available.
IIRC, compared to other microprocessors at the time, it was what I'd call a memory based and/or "load-store" design - few registers, fast zero page memory instructions, accumulator/memory based programming model, etc. IIRC, most of the magazines at the time called it load-store microprocessor too... IIRC, others, like the Z80 - which I never programmed - used registers far more heavily. Unfortunately, the term "load-store" has been warped over the ages. Wikipedia's usage doesn't match with my recollections.
> You sure you aren't looking through a romantic haze of > nostalgia? :-)
Probably, but with 13 addressing modes, who knows... ;-)
> That said, in protected mode > we had an effective DS register that is no longer there. Instead, for > OS code at least we are encouraged to use GS to locate local data.
A good reason to at least consider nanokernels...
> The > abolition of ... and the promotion of ... seems > inconsistent.
That general problem: inconsistency, between 16-bit, 32-bit, and 64-bit x86, is another reason I'm considering interpreters *alot* lately. I don't like their slowness, but if 128-bit comes out in a year or two, am I to rewrite everything again? If 64-bit AA isn't "whack" enough, what if they do something really, really radical for 256-bits, e.g., to conserve memory? Who wants 256-bit offsets and pointers? Code bloat? Justification for RIP-relative addressing? What happens if the ARM microprocessor eventually replaces the x86 in low-end laptop PC's? The continuous obsolescence and rejuvenation cycle of PC's and PC OSes is a real time and life "killer", IMO. I fully understand why *nix users don't want to give up *nix after decades of use. I didn't want to give up my C64 either...
> One option for running multiple instances of a piece of code is to > reserve a page for offsets or pointers and then change just that page > when switching between tasks. The page would be at a fixed location. > It's a slight workaround but as long as invlpg invalidates just one > page (which it should do but it's not guaranteed) it ought to be fast. > It's perhaps the best option I have (for lightweight task switching > between homogenous tasks). The switched page would always be at the > same location but each version would point to its local locations in > the address space.
Yeah, I'm _not_ sold on _not_ keeping applications completely separated. I think we may have conversed on this previously...
> Yet another one is to use different models in 32-bit and 64-bit modes. > The massive address space in 64-bit mode allows a different approach > to achieve similar results in memory management.
Except for the negative speed issue, I'm slowly becoming more and more convinced interpreters are the solution. I can stop and start an interpreter's emulation at will without risk of losing control of processor execution, e.g., no control-flow hacks from buffer overruns etc. Just how do you crash an interpreter? They are easy to write and maintain. They can produce fairly compact "code" as byte or token sequences. They don't need porting or recompiling for different cpu modes. But, they do need a different interpreter for each cpu mode or environment.
> The point is to prevent the OS design being controlled by existing > compilers - to prevent the tail from wagging the dog if you like.
Ah. Well, I agree "bare metal" is good. As a slight aside, Alexei proved to me that bare metal does hide some errors, and that I should eventually test on some emulators too.
> persistent activation records.
I'm not familiar with that term. A few quick search pulls up "Napier88" and "persistent programming language" and "orthogonal programming" and:
"Using C as a Compiler Target Language for Native Code Generation in Persistent Systems" by S.J. Bushell, A. Dearle, A.L. Brown, & F.A. Vaughan.
"Orthogonal persistence" from one of those seems to indicate a permanent or semi-permanent state, like an OS which doesn't start up or shutdown but resumes where it was previously. FORTH, when used as an OS, is sort-of similar. All changes to the FORTH environment get preserved.
The other indicates something about data continuing to exist after an application has been closed. However, it doesn't really specify what a continuing existence means. Is the data "existing" in memory? Is the data "existing" on disk? If memory, how is it accessable outside the application? etc.
> Functions which are not recursive don't need a stack frame either.
? ? ? ? ?
Hmm, I'll have to think about that statement for a while...
As a quick note, I avoid recursion and currently "see" beneficial use of stack frames, e.g.,
- allowing automatic ("auto" in C) allocation and cleanup of local variables in procedures, functions, subroutines - helping to implement call-by-value - reduces memory space needed by code for variables with temporary scope
> cruft. (I stand by the term.)
Y'know, it's claims like that lead to nickames that one hates... Right? e.g., James "Crufty" Harris... :-)
> even though C is low level by design it still doesn't allow the call > scheme to be controlled.
True. I've chosen C, for the most part. It can't do everything needed by an OS: "some assembly is required".
> The average compiler implements calls and > parameter passing the way it implements them and that's it.
How many calling conventions and/or parameter passing methods does one need? As I see it a system should only need one (or two like typical C...). Eventually everything on the system uses that one.
> > The traditional single-image layout where the text, data and bss > > sections are at the bottom and the stack is at the top with heap in > > between doesn't seem to allow much relocation. Instead it seems to me > > to mandate running each image in its own address space. This makes for > > slower switching between processes.
> Oooo... I must've misunderstood something.
> I basically consider a "single-image layout" to be the binary data that > comprises the executable application, i.e., a file. The text (code) and > data comprise the "image". The bss is allocated after the "image" in memory > by the executable loader/startup and cleared too. The stack and heap are > wherever the OS places them, AIUI. They can be one of each or many of each > or whatever..., AIUI.
You are right. I was talking about a basic image of a program in execution rather than that from disk.
...
> on the C64, people ask: "What's a 6510?"... It's a 6502 with a port.
...
> IIRC, compared to other microprocessors at the time, it was what I'd call a > memory based and/or "load-store" design - few registers, fast zero page > memory instructions, accumulator/memory based programming model, etc. IIRC, > most of the magazines at the time called it load-store microprocessor too... > IIRC, others, like the Z80 - which I never programmed - used registers far > more heavily.
The big advantage of the Z80 architecture was that it had 16-bit registers - and there were more of them. IIRC the humble 6502 had only one 8-bit accumulator and two 8-bit index registers. Sounds poor now but my recollection is that it was still great fun to use. Happy memories! I wrote a disassembler for it but, IIRC, we still wrote the programs in machine code. Of course its machine code was much easier to remember. There weren't the 2- and 3-bit fields and the variable length opcodes we have today.
> Unfortunately, the term "load-store" has been warped over the > ages. Wikipedia's usage doesn't match with my recollections.
Would referring the number of operands an instruction takes change less over time?
> Ah. Well, I agree "bare metal" is good. As a slight aside, Alexei proved > to me that bare metal does hide some errors, and that I should eventually > test on some emulators too.
Was that on a.o.d? I'd like to look in to that some more.
> > persistent activation records.
> I'm not familiar with that term. A few quick search pulls up "Napier88" and > "persistent programming language" and "orthogonal programming" and:
It wasn't meant as a term in itself. I was just talking about making an activation record persistent - i.e. to last between one activation and another. I think this can have both performance and security gains. (Less state to set up and static variables can stay in scope.)
...
> Y'know, it's claims like that lead to nickames that one hates... Right? > e.g., James "Crufty" Harris... :-)
That's a bit below the belt. We each have things we have stated we dislike. We don't want to get back to the law of the playground here, do we?
> > even though C is low level by design it still doesn't allow the call > > scheme to be controlled.
> True. I've chosen C, for the most part. It can't do everything needed by > an OS: "some assembly is required".
> > The average compiler implements calls and > > parameter passing the way it implements them and that's it.
> How many calling conventions and/or parameter passing methods does one need? > As I see it a system should only need one (or two like typical C...). > Eventually everything on the system uses that one.
I see your point. My take on this is that where there are better or faster ways of doing things an OS designer should consider them. I have absolutely no problem with having multiple ways for apps to interact with the OS as long as there is clear value in each one. For example, some may be simpler, others may be faster. In fact, perhaps the simpler ways can be wrappers around those which are faster. As you say, they all come back to a common base.
> I was just talking about making > an activation record persistent - i.e. to last between one activation > and another. I think this can have both performance and security > gains. (Less state to set up and static variables can stay in scope.)
One of the things I've become very fond of with DOS (and Windows 98) is the ability to start the OS in and from and a clean state.
It's one of things I dislike with newer versions of Windows which seem very paternalistic to me. If newer Windows journals something to the filesystem when I'm shutting down, I can't get rid of it - "the problem" - via a reboot like older OSes. The journalning increases the persistence of the problem. If the OS is experiencing some mystery glitch - which seems to happen with all versions of Linux and Windows - I can reboot the somes OSes (like DOS) and the problem goes away. With newer OSes, the problem seems to reappear due to persistent state, e.g., from journaling, or hibernation and sleep modes.
> > Y'know, it's claims like that lead to nickames that one hates... Right? > > e.g., James "Crufty" Harris... :-)
> That's a bit below the belt. We each have things we have stated we > dislike. We don't want to get back to the law of the playground here, do
we?
Sorry! That wasn't meant in a hostile way, but a humourous one. I was just pointing out your persistent activation record, er.. use of "cruft". ;-)
> > That said, in protected mode > > we had an effective DS register that is no longer there. Instead, for > > OS code at least we are encouraged to use GS to locate local data.
> A good reason to at least consider nanokernels...
The first part of quote on QNX from Wikipedia interests me, but the last sentence might interest you. It seems to be one method for implementing a "persistent activation record".
"The QNX kernel contains only CPU scheduling, interprocess communication, interrupt redirection and timers. Everything else runs as a user process, including a special process known as proc which performs process creation, and memory management by operating in conjunction with the microkernel. This is made possible by two key mechanisms - subroutine-call type interprocess communication, and a boot loader which can load an image containing not only the kernel but any desired collection of user programs and shared libraries."
**** Someone a while back (James? Ben?) asked what was needed in an OS. IIRC, I answered in terms of hardware and hardware interfaces, such as USB, harddisk, etc. However, from Wikipedia, it's easy to summarize what is needed for a microkernel:
A minimal microkernel - address spaces mechanisms - cpu scheduling - IPC
> > > That said, in protected mode > > > we had an effective DS register that is no longer there. Instead, for > > > OS code at least we are encouraged to use GS to locate local data.
> > A good reason to at least consider nanokernels...
> The first part of quote on QNX from Wikipedia interests me, but the last > sentence might interest you. It seems to be one method for implementing a > "persistent activation record".
> "The QNX kernel contains only CPU scheduling, interprocess communication, > interrupt redirection and timers. Everything else runs as a user process, > including a special process known as proc which performs process creation, > and memory management by operating in conjunction with the microkernel. > This is made possible by two key mechanisms - subroutine-call type > interprocess communication, and a boot loader which can load an image > containing not only the kernel but any desired collection of user programs > and shared libraries."
> **** > Someone a while back (James? Ben?) asked what was needed in an OS. IIRC, I > answered in terms of hardware and hardware interfaces, such as USB, > harddisk, etc. However, from Wikipedia, it's easy to summarize what is > needed for a microkernel:
> A minimal microkernel > - address spaces mechanisms > - cpu scheduling > - IPC
> > > As a slight aside, Alexei proved > > > to me that bare metal does hide some errors, and that I should > eventually > > > test on some emulators too.
> > Was that on a.o.d? I'd like to look in to that some more.
> IIRC, there were a couple.
> The most specific one I've been able to relocate was in a thread titled "new > FYSOS release". It was a thread discussing of Ben Lunt's FYSOS. (The first > message allows pulling up the entire thread, if wanted.):http://groups.google.com/group/alt.os.development/msg/b332b6f2b287ed8...
>I normally list the key points of a post in the subject heading but in > this case there are just too many.... The post is about detecting > application stack overflow and underflow and, in particular, > protecting and sizing the privileged stack in 32-bit and 64-bit modes.
> I'd appreciate your thoughts, suggestions and corrections.
> I'm looking at the base Intel and AMD 64-bit architecture (which I'll > call x86-64 herein) with a view to it influencing my 32-bit code. Why? > Well, it seems sensible to design 32-bit operations which don't > require too many changes to port to 64-bit later. I've not looked at > 64-bit working before. It is quite different, isn't it!
the main thing for 32/64 bit compatibility (at the C level) is writing fairly generic code...
> 1) In x86-64 the stack segment has base = 0 and limit = none as do > code and data segments. So it's not even an option to detect stack > overflow (a request for stack expansion) or underflow (trying to > remove more than the stack holds) by reference to the stack segment. > The only option I can think of is to have guard page frames above and > below every application (non-privileged) stack. These would be marked > not-present. Is this the best way to detect application stack overflow > and underflow?
yes, the flat model is the only real option for x86-64...
as for not-present pages around the stack, this is a common practice at least... another practice is to not bother and assume the stack is big enough, but this is more of a lazy option...
> 2) The privileged stack is a critical resource, isn't it? AFAICS it > must always have present memory to write to. If, in a page fault, > there is not enough stack space we'll get a double fault. And because > double faults are not restartable there is no apparent means of > recovery. So how is it best to provide privileged stack space? Should > its size be checked at the top or bottom of some or all service > routines, or can all service routines be written to unwind it before > returning to user mode? It seems so but it would be good to hear what > you guys have done or are thinking of.
my comment: treat the stack as one normally treats the stack, as in, always fully unwind before returning.
granted, there are several different ways to handle the Ring3->Ring0 transition. I forget the details (my OS dev work was a long time ago), but I had remembered that the initial transition to ring-0 had took place while using the ring-3 stack, which was used to save off the ring-3 state, and at this point a bit of magic was used to swap the stack (and maybe fully transition to ring-0), and transfer control back into the C code in the kernel.
from what I remember, this mechanism was also used in thread and process switching. in this way, the state of a thread was always saved at the bottom of the stack.
> 3) If the privileged stack must always be large enough how much space > should be set aside? If it is only used to service interrupts and > syscalls it probably doesn't need to be very big. A 4k page seems much > too large. The bulk of the state can be saved in a thread image if > desirable.
errm, you will usually end up doing stuff in the kernel (as in, calling through piles of C code, ...), so one will need a little more stack than this...
in my OS, from what I remember, I used 32kB for the ring-0 stack. actually, from what I remember, I set up the stack in the boot loader to point just below the boot loader, and then proceeded to load the second-stage loader to 0x8000, which used the same stack. then I loaded the kernel (I think between 0x10000 and 0x9FFFF or similar...), and continued using the same stack after kernel startup.
(this may have changed later on, as I think I am remembering something about having used RLEW compression on the kernel, but am not sure...).
memory above 1M was then generally used for kernel heap and for application code/data.
so, the stack was 32kB mostly because the stack top was at 0x7C00 or similar.
similarly, the second stage remained in memory after kernel startup, itself mostly serving as a place for holding the GDT and IDT, and also any realmode components of the kernel (where some drivers, such as for VESA, had worked by essentially jumping back and forth between realmode and protected mode, ...).
from what I remember, the kernel also generally operated with a (mostly) plain raw address space view of the world (except I also remembering using some trick where the page table included itself as an index, such that the whole page table looked like a flat 4MB buffer regardless of which pages were used to build its structure).
but, alas, it is difficult to remember the specifics after what is around 7-8 years now...