Re: memset help

#989

On 4/14/20 11:32 PM, Peter Coghlan wrote:
I wonder could this have been the COBOL compiler abusing MVCL instructions
in situations where they were not the appropriate instructions to use?

Perhaps instructions such as MVCL would be expected to be "hot spots" because
they can deliver a relatively large amount of work for a single instruction?
Or is it that implementations of this instruction were sometimes poorer than
they ought to be and they were really not delivering bang for buck?I was told back in the 1980s that for performance reasons MUSIC moved 4096 bytes of data via a series of MVC commands in place of one MVCL.

-- 
Drew Derbyshire

"All right, Mr. DeMille, I'm ready for my close-up." -- "Sunset Blvd.,"

Re: memset help

#988

On Wed, Apr 15, 2020 at 08:02 AM, adriansutherland67 wrote:

it takes me about 5 mins to write a single line of S/370 assembler not counting debugging!

I do indeed loath assembler - especially as I use the "infinite number of monkeys" method - however I have managed to detect the stack running out of space. Big day!

A

Re: memset help

#987

On Wed, Apr 15, 2020 at 01:42 PM, pjfarley3 wrote:

Apologies for the mistake.

No drama!

Re: memset help

#986

开云体育

Adrian,

?

I made a coding error in all three of the MEMSET texts that I sent – in all of them I have this assembler instruction to declare register 15 as the code base register:

?

???????? USING 15,*????????? USE R15 AS BASE REG

?

Wrong order of the operands.? Please change this to the following in all three versions:

?

???????? USING *,15????????? USE R15 AS BASE REG

?

The syntax is “ USING BASEADDR,BASEREG[,BASEREG]* “

?

Apologies for the mistake.

?

Peter

?

Show quoted text

From: [email protected] <[email protected]> On Behalf Of adriansutherland67
Sent: Wednesday, April 15, 2020 3:02 AM
To: [email protected]
Subject: Re: [h390-vm] memset help

?

All interesting ... and I will try out each candidate and report back. It will be tested only on Hercules so in one sense not a fair test. On the other hand we could argue that that Hercules is S/370 done well ... meaning I agree with a comment that if a CPU manufacturer defines a bulk move command they should implement it fast!

One thing, modern compilers generally produce faster code than a human assembler programmer. This is because both of front end optimisations (e.g. reassigning / calculating values that have not changes), and backend optimisation based on loop unrolling, and instruction reordering based on CPU pipelines etc.

This is why LLVM is becoming the one toolchain to rule them all ... even IBM works with them to ensure mainframe CPU internals are fully leveraged.

A

Re: memset help

#985

Show quoted text

-----Original Message-----
From: [email protected] <[email protected]> On Behalf Of Harold
Grovesteen
Sent: 15 April 2020 14:15
To: [email protected]
Subject: Re: [h390-vm] memset help

On Wed, 2020-04-15 at 13:24 +0100, Steven Fosdick wrote:

I did wonder about the possibility of setting up gcc as a
cross-compiler but that doesn't seem trivial to do.

Steve.
Yes.  I did accomplish this and it is documented with scripts in the SATK.  It
used GNU as as the assembler for stand alone, aka bare metal, coding on
Hercules.  After literally years of work on that, it just did not work well
enough and I changed to a new toolset that is part of the project.

However, the key difference between GCC on VM and other operating
systems supported by Hercules and GCC as used on Linux is the output
format.  GCC typically generates ELF object module files.  The GCC on VM
generates mainframe object modules.  Huge difference and a fundamental
reason this GCC is used with the operating systems that run on Hercules.

With my pedants' hat on, it actually generates "normal" Assembler that is assembled by the XF assembler on VM or MVS.
That is why we get "normal" object files which can be loaded with the VM loader or the OS Linkage Editor.
I have tried feeding the assembler into Assembler G with poor results.
I haven't tried Assembler H...
.... its pretty easy to produce a GCC that compiles 370 code on Windows/Linux. After all that’s how I built the first GCCCMS.
Getting all the assembler to CMS to compile it was the fun part..

Harold Grovesteen

Dave

Re: memset help

#984

On Wed, 2020-04-15 at 13:24 +0100, Steven Fosdick wrote:
?
I did wonder about the possibility of setting up gcc as a
cross-compiler but that doesn't seem trivial to do.

Steve.
Yes. ?I did accomplish this and it is documented with scripts in the
SATK. ?It used GNU as as the assembler for stand alone, aka bare metal,
coding on Hercules. ?After literally years of work on that, it just did
not work well enough and I changed to a new toolset that is part of the
project.

However, the key difference between GCC on VM and other operating
systems supported by Hercules and GCC as used on Linux is the output
format. ?GCC typically generates ELF object module files. ?The GCC on
VM generates mainframe object modules. ?Huge difference and a
fundamental reason this GCC is used with the operating systems that run
on Hercules.

Harold Grovesteen

Re: memset help

#983

On Wed, 2020-04-15 at 00:02 -0700, adriansutherland67 wrote:
?

One thing, modern compilers generally produce faster code than a
human assembler programmer. This is because both of front end
optimisations (e.g. reassigning / calculating values that have not
changes), and backend optimisation based on loop unrolling, and
instruction reordering based on CPU pipelines etc.
I remind everyone working with GCC on VM/370 that it is a port of the
old i370 architecture version of GCC. ?The modern optimizations are
likely to be quite limited. ?This GCC is not the same version of GCC
that is used today.

The GCC group had decided to remove i370 from the product because of
its lack of use or development. ?It was rescued and modified to run on
the various mainframe operating systems.

Whether this is a consideration or not, there are other versions of
this GCC that are essentially the same. ?The GCC that runs on other
operating systems are from the same source code. ?I do not know enough
of the inner workings to know where the operating system dependent code
exists (which is different for each) and what is common.

If it is important that this altered version of GCC on VM/370 is source
compatible, care needs to be taken as to what is altered.

Just a heads up, but I think this should be understood.

Harold Grovesteen

Re: memset help

#982

开云体育

Hi Steven,

no, not that I know of. But it is good to remember because assumptions that it will end during a scheduler slice can be false.

That is, in z/OS, where I, long time ago, came across errors based on that assumption, some my own. I don’t know about VM.

搁别苍é.

Show quoted text

On 15 Apr 2020, at 14:24, Steven Fosdick <stevenfosdick@...> wrote:

On Tue, 14 Apr 2020 at 23:14,?rvjansen@...?<rvjansen@...> wrote:

The other thing to remember about mvcl is that it is interruptible.

Is that part of the performance issue? ?I mean indirectly, of course.

Re: memset help

#981

On Tue, 14 Apr 2020 at 23:14, rvjansen@... <rvjansen@...> wrote:

The other thing to remember about mvcl is that it is interruptible.Is that part of the performance issue?  I mean indirectly, of course.
I don't really know the 370 architecture but I have come across a
similar move instruction, LDIR on the Z80 that is rather slow.  That's
a relative term because it's still faster than writing the loop
yourself.

In the case of LDIR there is a non-repeating version (LDI) which loads
the value whose address is in register HL, stores it to the address in
register DE, increments HL and DE and decrements BC.  The repeating
version works by adding a final step of testing BC and, if that is not
zero, it forces the program counter back to the address of the LDIR
instruction.  That means the LDIR instruction is now re-fetched,
re-decoded and re-executed and the process repeats for each byte moved
until BC becomes zero.

It is interruptible and the interrupt is serviced just before the LDIR
instruction is re-fetched so it would push the values for BC, DE, HL
from halfway through the move, service the interrupt, then pop those
values and carry on where it left off.

Back to memset on 370, it's great to have an efficient implementation
in the library but having the compiler inline it would make it even
faster.  Apart from removing the call overhead the compiler may know
the length already, i.e. it may be a constant expression, and can thus
avoid tests and having two loops, one for 256 bytes and one for the
remainder.  It should also know if the data are aligned.  I know gcc
can and does do this, at least on X86 - here's an example, first the
C:

#include <string.h>

extern void do_something(char *x);

int main(int argc, char *argv[])
{
    char x[45];
    memset(x, 0, sizeof(x));
    do_something(x);
    return 0;
}

I have deliberately declared an external function to received the
string so the compiler does not detect dead code and remove it
altogether.  Here's the result to compiling to assembler:

        .file   "memstst.c"
        .text
        .section        .text.startup,"ax",@progbits
        .p2align 4
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        subq    $72, %rsp
        .cfi_def_cfa_offset 80
        pxor    %xmm0, %xmm0
        movq    %fs:40, %rax
        movq    %rax, 56(%rsp)
        xorl    %eax, %eax
        movq    %rsp, %rdi
        movl    $0, 40(%rsp)
        movq    $0, 32(%rsp)
        movb    $0, 44(%rsp)
        movaps  %xmm0, (%rsp)
        movaps  %xmm0, 16(%rsp)
        call    do_something@PLT
        movq    56(%rsp), %rax
        xorq    %fs:40, %rax
        jne     .L5
        xorl    %eax, %eax
        addq    $72, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
.L5:
        .cfi_restore_state
        call    __stack_chk_fail@PLT
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Arch Linux 9.3.0-1) 9.3.0"
        .section        .note.GNU-stack,"",@progbits

Note the absence of a call to memset.  So the core of this is using a
zeroed 16 byte wide register:

        movaps  %xmm0, (%rsp)
        movaps  %xmm0, 16(%rsp)

for the lower part of the space and a selection of other lengths for
the remainder:

        movl    $0, 40(%rsp)
        movq    $0, 32(%rsp)
        movb    $0, 44(%rsp)

I did wonder about the possibility of setting up gcc as a
cross-compiler but that doesn't seem trivial to do.

Steve.

Re: memset help

#980

开云体育

Translate - like the rexx function but then in assembler.

Anecdote: I once got a performance problem on my desk. It did a character translation, in C. It looped through a string, and replaced character by character.

I noticed it had one flaw: it did not stop when it found the right character, but went on to 255 - for every character. (true story!)

Next day I replaced it by an assembler version with two tables and a translate instruction. The application flew.

(this was on a 486, with XLATB, but it is the same thing. These things are hard to figure out for a compiler).

(this is also why we need drivers for e.g. SSL to use cryptographic assist/acceleration instructions - the compiler won’t do that for you - that would be at bit like clippy telling you “ah, I see you are using a bubblesort, let me replace that by a blockset search").

搁别苍é.

?

Show quoted text

On 15 Apr 2020, at 10:02, adriansutherland67 <adrian@...> wrote:

On Wed, Apr 15, 2020 at 07:11 AM, rvjansen@... wrote:

TR and TRT
?

Anyway I am biased ... as it takes me about 5 mins to write a single line of S/370 assembler not counting debugging!

Re: memset help

#979

On Wed, Apr 15, 2020 at 07:15 AM, Dave Wade wrote:

I doubt any change on Hercules will yield such a performance increase

Now that sounds like a challenge!

Let's see what we get with GCCLIB IO improvements. I will make sure my test lines are at least 100 characters long :-)

Re: memset help

#978

On Wed, Apr 15, 2020 at 07:11 AM, rvjansen@... wrote:

TR and TRT

?

Anyway I am biased ... as it takes me about 5 mins to write a single line of S/370 assembler not counting debugging!

Re: memset help

#977

开云体育

Peter,

?

When I first started work, I worked in small insurance UK insurance company. We had a Honeywell H3200, which basically ran IBM1401 code, but with “Improved io”…

We had one small program that was very slow. It was Cobol. We replace lots of conditional performs with “ALTER” statements. It ran about 100 times faster but was totally understandable.

I doubt any change on Hercules will yield such a performance increase. As Michael Jackson said

?

“I am on a world tour. My tour is in pursuit of exceptional beer. That's why they call me the Beer Hunter.”

?

That’s a pursuit I try and emulate to the best of my ability….

… that was by the way Michael the Beer Hunter

?

.. another Michael Jackson said, in “Principles of Program Design”

?

there are two rules for optimization…

?

Don’t do it
Don’t do it yet.

?

I think we are still at Rule 1. ?You can’t optimize something that doesn’t work…..

… for you to run “Strobe” it had to work…

?

Dave

p.s. I believe that some folks think Michael Jackson the singer and dancer was most famous. Possibly this is true, but I still prefer the works of the two above..

?

Show quoted text

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 15 April 2020 07:12
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Dave,

?

I’m acutely aware of that IBM advice, but in the last two decades I have also been involved in multiple rounds of “MIPS-saving” projects when management wanted application teams to “do more with less” (i.e. increase performance / reduce batch execution times without buying more/bigger hardware).

?

The most effective solutions in those projects were finding the “CPU hot spots” (the Strobe product was always particularly effective for such efforts), and more times than not the worst “hot spots” turned out to be MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and the second worst “hot spots” were the COBOL INITIALIZE verb for large heterogenous structures used inside of a loop, or at every invocation of a subroutine.

?

Finding ways to hoist long-length moves/compares and INITIALIZE verbs out of business processing loops usually yielded the best/largest reductions of CPU and elapsed time used.? Second-best solutions were complicated and usually application-specific adjustments to business processing rules (along the lines of “the fastest I/O or business process is the one not done”).

?

But I digress from the subject at hand.? You and Harold are right here, for Hercules fewer instructions will yield better performance, so if replacement of the C version of MEMSET would dramatically improve performance for C programs then the MVCL solution will probably work best under Hercules.

?

Peter

?

From: [email protected] <[email protected]> On Behalf Of Dave Wade
Sent: Tuesday, April 14, 2020 7:17 PM
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

Re: memset help

#976

开云体育

It was for a long time already that compilers made faster code because the human coders kept on choosing storage-to-storage instructions where the register based versions were faster.

But in cases like TR and TRT the assembler programmer always wins.

搁别苍é.

Show quoted text

On 15 Apr 2020, at 09:02, adriansutherland67 <adrian@...> wrote:

?All interesting ... and I will try out each candidate and report back. It will be tested only on Hercules so in one sense not a fair test. On the other hand we could argue that that Hercules is S/370 done well ... meaning I agree with a comment that if a CPU manufacturer defines a bulk move command they should implement it fast!

One thing, modern compilers generally produce faster code than a human assembler programmer. This is because both of front end optimisations (e.g. reassigning / calculating values that have not changes), and backend optimisation based on loop unrolling, and instruction reordering based on CPU pipelines etc.

This is why LLVM is becoming the one toolchain to rule them all ... even IBM works with them to ensure mainframe CPU internals are fully leveraged.

A

Re: memset help

#975

All interesting ... and I will try out each candidate and report back. It will be tested only on Hercules so in one sense not a fair test. On the other hand we could argue that that Hercules is S/370 done well ... meaning I agree with a comment that if a CPU manufacturer defines a bulk move command they should implement it fast!

One thing, modern compilers generally produce faster code than a human assembler programmer. This is because both of front end optimisations (e.g. reassigning / calculating values that have not changes), and backend optimisation based on loop unrolling, and instruction reordering based on CPU pipelines etc.

This is why LLVM is becoming the one toolchain to rule them all ... even IBM works with them to ensure mainframe CPU internals are fully leveraged.

A

Re: memset help

#974

The most effective solutions in those projects were finding the "CPU hot
spots" (the Strobe product was always particularly effective for such
efforts), and more times than not the worst "hot spots" turned out to be
MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and
the second worst "hot spots" were the COBOL INITIALIZE verb for large
heterogenous structures used inside of a loop, or at every invocation of a
subroutine.
I wonder could this have been the COBOL compiler abusing MVCL instructions
in situations where they were not the appropriate instructions to use?

Perhaps instructions such as MVCL would be expected to be "hot spots" because
they can deliver a relatively large amount of work for a single instruction?
Or is it that implementations of this instruction were sometimes poorer than
they ought to be and they were really not delivering bang for buck?

Regards,
Peter Coghlan.

Re: memset help

#973

开云体育

Dave,

?

I’m acutely aware of that IBM advice, but in the last two decades I have also been involved in multiple rounds of “MIPS-saving” projects when management wanted application teams to “do more with less” (i.e. increase performance / reduce batch execution times without buying more/bigger hardware).

?

The most effective solutions in those projects were finding the “CPU hot spots” (the Strobe product was always particularly effective for such efforts), and more times than not the worst “hot spots” turned out to be MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and the second worst “hot spots” were the COBOL INITIALIZE verb for large heterogenous structures used inside of a loop, or at every invocation of a subroutine.

?

Finding ways to hoist long-length moves/compares and INITIALIZE verbs out of business processing loops usually yielded the best/largest reductions of CPU and elapsed time used.? Second-best solutions were complicated and usually application-specific adjustments to business processing rules (along the lines of “the fastest I/O or business process is the one not done”).

?

But I digress from the subject at hand.? You and Harold are right here, for Hercules fewer instructions will yield better performance, so if replacement of the C version of MEMSET would dramatically improve performance for C programs then the MVCL solution will probably work best under Hercules.

?

Peter

?

Show quoted text

From: [email protected] <[email protected]> On Behalf Of Dave Wade
Sent: Tuesday, April 14, 2020 7:17 PM
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

_._,_._,_

Re: memset help

#972

开云体育

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

Show quoted text

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

?

Re: memset help

#971

开云体育

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

Show quoted text

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

?

MEMSETCL.txt

Re: memset help

#970

开云体育

The other thing to remember about mvcl is that it is interruptible.

搁别苍é.

Show quoted text

On 15 Apr 2020, at 00:07, Bob Polmanter <wably@...> wrote:

?Adrian,

So is MVCL really that bad on real hardware?
Like so many things, it depended on how it was used.? MVCL was probably used more to clear storage or fill it with some character way more often than to actually move data from A to B.? The clear or fill operation was the part that was slow, because it is considered an overlapping operation.? But the MVC instruction also had a penalty if it was used in an overlapping operation, for example to clear a print line with blanks:

? MVI?? LINE,C'? '
? MVC LINE+1(131),LINE

Most of this is because the 370 hardware back in the day was focused on doubleword memory accesses and where these doublewords fell in processor caches.? Overlaps were painful.? But that's a relative term.?
We're talking microseconds (back in the day).? If an instruction took 3 microseconds to execute, a penalty case might be 15 microseconds.? So basically, as long as you did not issue the penalty instruction in a loop it did not matter all that much if it was just issued only occasionally.?? I could clear the print line by moving 132 blanks already defined elsewhere to the print line in the 3 microsecond case.? But, as with everything there is a price.? That meant I had to waste 132 bytes of storage holding blanks.? In the early days this was an important consideration too.

In Hercules, none of this matters and you shouldn't be focused on it.? The execution and timing of instructions emulated by Hercules will not compare to real hardware from back in the day. A MVCL instruction and a MVC instruction for the same length moved is probably about the same cost in Hercules.? But if the MVCL length is more than 256, then the MVCL will almost certainly be faster than multiple MVCs to do it.? This is because of the cost of all the other things that Hercules has to do: instruction decode and set up,? check the storage key and fetch and store status of both operands, move the data.? MVC will have to do this for every instruction executed.? MVCL will only have to do it one time initially, then the storage access part (only) once per 4K page boundary.? The point here is that Hercules just does it differently than the hardware did and they can't be compared at all.

In an emulated environment, I reiterate what Harold said:? the fewer instructions the better.? That's what matters most.

If you are really concerned about performance, you should be using Assembler.? I realize that is not likely to happen.? But you just can't have a fine enough control with C to influence the number or types of instructions issued in most cases.? Either that or call a lot of Assembler subroutines.

Regards,
Bob