开云体育

memset help


 

Folks

Currency GCCLIB has memset() as

void *memset(void *s, int c, size_t n)
{
? ? size_t x = 0;
?
? ? for (x = 0; x < n; x++)
? ? {
? ? ? ? *((char *)s + x) = (unsigned char)c;
? ? }
? ? return (s);
}

Slow ...

Is anyone willing to do an optimised S/370 assembler version?

If people want a competition, I am happy to benchmark: needs to win both for small memory areas and for large memory areas :-)

Thanks!

Adrian


 

开云体育

Adrian

I am grumpy because I am trying to replace a web front end that only works on IE with compatibility mode on Internet Explorer with some ASP.Net that works on any platform….

?

Any way the code below is a tad slow but it works on any platform. There are lots of other things that don’t work, isn’t fiddling with this a distraction…. ?I thought the current effort was to move stuff from assembler to “C” not the other way…

?

Dave

?

From: [email protected] <[email protected]> On Behalf Of adriansutherland67
Sent: 13 April 2020 18:33
To: [email protected]
Subject: [h390-vm] memset help

?

Folks

Currency GCCLIB has memset() as

void *memset(void *s, int c, size_t n)

{

? ? size_t x = 0;

?

? ? for (x = 0; x < n; x++)

? ? {

? ? ? ? *((char *)s + x) = (unsigned char)c;

? ? }

? ? return (s);

}

Slow ...

Is anyone willing to do an optimised S/370 assembler version?

If people want a competition, I am happy to benchmark: needs to win both for small memory areas and for large memory areas :-)

Thanks!

Adrian


 

In 2020 we are all allowed to be grumpy!

Its not my biggest concern (bytewise IO is that) but it is an example where a s/370 assembler guru could spend 10 mins to smash it. Even in C I would try and do it dword by dword ...

For any application we port that uses calloc() it would make a difference.

But as you say if a magic answer doesn't appear soon, we have a functional bit of code.

We are making progress: atexit, signals (which sounds like it will have to be internal to gcclib), longjmp and we have cleared the remaining non-RESLIB detritus. Dynamic stack, svc202 work arrounds, Io optimisations and we have a baseline. Then housekeeping like automatic reslib stub generation and MOST important automatic testing.

(Ok a bit to do!)


 

开云体育

As an academic exercise, two assembler alternatives attached as text files.

?

MEMSET.txt uses STC/MVC to set 256 bytes at a time and then STC and variable-length MVC to set the remainder less than 256 bytes, with optimizations not to set anything unneeded if N is zero or an integer multiple of 256.

?

MEMSET16.txt uses 2 more registers than MEMSET.txt but replaces the length=255 MVC instruction with multiple STore operations (loop 16 times storing 4 bytes at a time 4 times for each loop around).? MVC was notoriously slow on some real-iron IBM hardware models.

?

Not sure which technique would have been faster on any real 360-era iron, but there could be differences in the Hercules MVC vs STore operations that may make the STore solution faster (or not).

?

These are untested, so I could have errors in my coding.? Corrections and improvements welcome.

?

A C solution doing the same sort of thing as MEMSET16.txt using casts to INT and taking advantage of the C compiler’s optimization and code generation could be even faster.? Something along these lines doing 16 bytes at a time (could obviously do 32 or 64 each loop as well, but 16 gives you the picture):

?

void *memset(void *s, int c, size_t n)

{

? ? size_t x;

??? int cccc = c + (c << 8) + (c << 16) + (c << 24);

?

? ? for (x = 0; x < (n / 16); x+=16)

? ? {

? ? ? ? *((int *)((char *)s + x???? )) = cccc;

? ? ? ? *((int *)((char *)s + x + ?4)) = cccc;

? ? ? ? *((int *)((char *)s + x + ?8)) = cccc;

? ? ? ? *((int *)((char *)s + x + 12)) = cccc;

? ? }

? ? for (; x < n; x++)

? ? {

? ? ? ? *((char *)s + x) = (unsigned char)c;

? ? }

? return (s);

}

?

HTH

?

Peter

?

From: [email protected] <[email protected]> On Behalf Of adriansutherland67
Sent: Monday, April 13, 2020 1:33 PM
To: [email protected]
Subject: [h390-vm] memset help

?

Folks

Currency GCCLIB has memset() as

void *memset(void *s, int c, size_t n)

{

? ? size_t x = 0;

?

? ? for (x = 0; x < n; x++)

? ? {

? ? ? ? *((char *)s + x) = (unsigned char)c;

? ? }

? ? return (s);

}

Slow ...

Is anyone willing to do an optimised S/370 assembler version?

If people want a competition, I am happy to benchmark: needs to win both for small memory areas and for large memory areas :-)

Thanks!

Adrian


 

Thanks Peter ... I will light up both and provide relative performance stats.

Also (and this shows how much stuff is just there in the public domain) have a look at this.?



Best (fastest) solution by the weekend gets the glory ...

?


 


As an academic exercise, two assembler alternatives attached as text files.



MEMSET.txt uses STC/MVC to set 256 bytes at a time and then STC and
variable-length MVC to set the remainder less than 256 bytes, with
optimizations not to set anything unneeded if N is zero or an integer
multiple of 256.



MEMSET16.txt uses 2 more registers than MEMSET.txt but replaces the
length=255 MVC instruction with multiple STore operations (loop 16 times
storing 4 bytes at a time 4 times for each loop around). MVC was
notoriously slow on some real-iron IBM hardware models.



Not sure which technique would have been faster on any real 360-era iron,
but there could be differences in the Hercules MVC vs STore operations that
may make the STore solution faster (or not).



These are untested, so I could have errors in my coding. Corrections and
improvements welcome.
Why don't you suggest using an MVCL instruction?

Regards,
Peter Coghlan.


 

On Tue, 2020-04-14 at 08:32 +0100, Peter Coghlan wrote:

As an academic exercise, two assembler alternatives attached as
text files.

?

MEMSET.txt uses STC/MVC to set 256 bytes at a time and then STC and
variable-length MVC to set the remainder less than 256 bytes, with
optimizations not to set anything unneeded if N is zero or an
integer
multiple of 256.
?
??
?
MEMSET16.txt uses 2 more registers than MEMSET.txt but replaces the
length=255 MVC instruction with multiple STore operations (loop 16
times
storing 4 bytes at a time 4 times for each loop around).? MVC was
notoriously slow on some real-iron IBM hardware models.
?
??
?
Not sure which technique would have been faster on any real 360-era
iron,
but there could be differences in the Hercules MVC vs STore
operations that
may make the STore solution faster (or not).
?
?

These are untested, so I could have errors in my coding.?
Corrections and
improvements welcome.
Why don't you suggest using an MVCL instruction?

Regards,
Peter Coghlan.
Regardless of the various performance characteristics that may have
existed on real 370-era hardware, VM/370 runs today on Hercules. ?So
what makes sense? ?The fewer instructions the better. ?I would
recommend use of the MVCL instruction as well.

Remember that when using C on Hercules one compiles the C to machine
instructions that again are interpreted by another C program,
Hercules.?

This is not to say that any part of GCC should or should not
incorporate assembler nor am I saying that performance should not be
tested.

I am saying that, in general, fewer instructions the better. ?And we
all know that this is particularly true for loops.

MVCL allows the loops to be embedded within the interpreter,
eliminating them as part of the GCC implementation.

Harold Grovesteen


 

On Tue, Apr 14, 2020 at 04:39 PM, Harold Grovesteen wrote:
VM/370 runs today on Hercules.
It is true that the performance will be measured on Hercules.

So is MVCL really that bad on real hardware?

A


 

Adrian,
So is MVCL really that bad on real hardware?
Like so many things, it depended on how it was used.? MVCL was probably used more to clear storage or fill it with some character way more often than to actually move data from A to B.? The clear or fill operation was the part that was slow, because it is considered an overlapping operation.? But the MVC instruction also had a penalty if it was used in an overlapping operation, for example to clear a print line with blanks:

? MVI?? LINE,C'? '
? MVC LINE+1(131),LINE

Most of this is because the 370 hardware back in the day was focused on doubleword memory accesses and where these doublewords fell in processor caches.? Overlaps were painful.? But that's a relative term.?
We're talking microseconds (back in the day).? If an instruction took 3 microseconds to execute, a penalty case might be 15 microseconds.? So basically, as long as you did not issue the penalty instruction in a loop it did not matter all that much if it was just issued only occasionally.?? I could clear the print line by moving 132 blanks already defined elsewhere to the print line in the 3 microsecond case.? But, as with everything there is a price.? That meant I had to waste 132 bytes of storage holding blanks.? In the early days this was an important consideration too.

In Hercules, none of this matters and you shouldn't be focused on it.? The execution and timing of instructions emulated by Hercules will not compare to real hardware from back in the day. A MVCL instruction and a MVC instruction for the same length moved is probably about the same cost in Hercules.? But if the MVCL length is more than 256, then the MVCL will almost certainly be faster than multiple MVCs to do it.? This is because of the cost of all the other things that Hercules has to do: instruction decode and set up,? check the storage key and fetch and store status of both operands, move the data.? MVC will have to do this for every instruction executed.? MVCL will only have to do it one time initially, then the storage access part (only) once per 4K page boundary.? The point here is that Hercules just does it differently than the hardware did and they can't be compared at all.

In an emulated environment, I reiterate what Harold said:? the fewer instructions the better.? That's what matters most.

If you are really concerned about performance, you should be using Assembler.? I realize that is not likely to happen.? But you just can't have a fine enough control with C to influence the number or types of instructions issued in most cases.? Either that or call a lot of Assembler subroutines.

Regards,
Bob


 

开云体育

The other thing to remember about mvcl is that it is interruptible.

搁别苍é.

On 15 Apr 2020, at 00:07, Bob Polmanter <wably@...> wrote:

?Adrian,
So is MVCL really that bad on real hardware?
Like so many things, it depended on how it was used.? MVCL was probably used more to clear storage or fill it with some character way more often than to actually move data from A to B.? The clear or fill operation was the part that was slow, because it is considered an overlapping operation.? But the MVC instruction also had a penalty if it was used in an overlapping operation, for example to clear a print line with blanks:

? MVI?? LINE,C'? '
? MVC LINE+1(131),LINE

Most of this is because the 370 hardware back in the day was focused on doubleword memory accesses and where these doublewords fell in processor caches.? Overlaps were painful.? But that's a relative term.?
We're talking microseconds (back in the day).? If an instruction took 3 microseconds to execute, a penalty case might be 15 microseconds.? So basically, as long as you did not issue the penalty instruction in a loop it did not matter all that much if it was just issued only occasionally.?? I could clear the print line by moving 132 blanks already defined elsewhere to the print line in the 3 microsecond case.? But, as with everything there is a price.? That meant I had to waste 132 bytes of storage holding blanks.? In the early days this was an important consideration too.

In Hercules, none of this matters and you shouldn't be focused on it.? The execution and timing of instructions emulated by Hercules will not compare to real hardware from back in the day. A MVCL instruction and a MVC instruction for the same length moved is probably about the same cost in Hercules.? But if the MVCL length is more than 256, then the MVCL will almost certainly be faster than multiple MVCs to do it.? This is because of the cost of all the other things that Hercules has to do: instruction decode and set up,? check the storage key and fetch and store status of both operands, move the data.? MVC will have to do this for every instruction executed.? MVCL will only have to do it one time initially, then the storage access part (only) once per 4K page boundary.? The point here is that Hercules just does it differently than the hardware did and they can't be compared at all.

In an emulated environment, I reiterate what Harold said:? the fewer instructions the better.? That's what matters most.

If you are really concerned about performance, you should be using Assembler.? I realize that is not likely to happen.? But you just can't have a fine enough control with C to influence the number or types of instructions issued in most cases.? Either that or call a lot of Assembler subroutines.

Regards,
Bob


 

开云体育

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

<Snipped>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

?


 

开云体育

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

<Snipped>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

?


 

开云体育

Dave,

?

I’m acutely aware of that IBM advice, but in the last two decades I have also been involved in multiple rounds of “MIPS-saving” projects when management wanted application teams to “do more with less” (i.e. increase performance / reduce batch execution times without buying more/bigger hardware).

?

The most effective solutions in those projects were finding the “CPU hot spots” (the Strobe product was always particularly effective for such efforts), and more times than not the worst “hot spots” turned out to be MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and the second worst “hot spots” were the COBOL INITIALIZE verb for large heterogenous structures used inside of a loop, or at every invocation of a subroutine.

?

Finding ways to hoist long-length moves/compares and INITIALIZE verbs out of business processing loops usually yielded the best/largest reductions of CPU and elapsed time used.? Second-best solutions were complicated and usually application-specific adjustments to business processing rules (along the lines of “the fastest I/O or business process is the one not done”).

?

But I digress from the subject at hand.? You and Harold are right here, for Hercules fewer instructions will yield better performance, so if replacement of the C version of MEMSET would dramatically improve performance for C programs then the MVCL solution will probably work best under Hercules.

?

Peter

?

From: [email protected] <[email protected]> On Behalf Of Dave Wade
Sent: Tuesday, April 14, 2020 7:17 PM
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

<Snipped>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--

_._,_._,_


 


The most effective solutions in those projects were finding the "CPU hot
spots" (the Strobe product was always particularly effective for such
efforts), and more times than not the worst "hot spots" turned out to be
MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and
the second worst "hot spots" were the COBOL INITIALIZE verb for large
heterogenous structures used inside of a loop, or at every invocation of a
subroutine.
I wonder could this have been the COBOL compiler abusing MVCL instructions
in situations where they were not the appropriate instructions to use?

Perhaps instructions such as MVCL would be expected to be "hot spots" because
they can deliver a relatively large amount of work for a single instruction?
Or is it that implementations of this instruction were sometimes poorer than
they ought to be and they were really not delivering bang for buck?

Regards,
Peter Coghlan.


 

All interesting ... and I will try out each candidate and report back. It will be tested only on Hercules so in one sense not a fair test. On the other hand we could argue that that Hercules is S/370 done well ... meaning I agree with a comment that if a CPU manufacturer defines a bulk move command they should implement it fast!

One thing, modern compilers generally produce faster code than a human assembler programmer. This is because both of front end optimisations (e.g. reassigning / calculating values that have not changes), and backend optimisation based on loop unrolling, and instruction reordering based on CPU pipelines etc.

This is why LLVM is becoming the one toolchain to rule them all ... even IBM works with them to ensure mainframe CPU internals are fully leveraged.

A


 

开云体育

It was for a long time already that compilers made faster code because the human coders kept on choosing storage-to-storage instructions where the register based versions were faster.

But in cases like TR and TRT the assembler programmer always wins.

搁别苍é.

On 15 Apr 2020, at 09:02, adriansutherland67 <adrian@...> wrote:

?All interesting ... and I will try out each candidate and report back. It will be tested only on Hercules so in one sense not a fair test. On the other hand we could argue that that Hercules is S/370 done well ... meaning I agree with a comment that if a CPU manufacturer defines a bulk move command they should implement it fast!

One thing, modern compilers generally produce faster code than a human assembler programmer. This is because both of front end optimisations (e.g. reassigning / calculating values that have not changes), and backend optimisation based on loop unrolling, and instruction reordering based on CPU pipelines etc.

This is why LLVM is becoming the one toolchain to rule them all ... even IBM works with them to ensure mainframe CPU internals are fully leveraged.

A


 

开云体育

Peter,

?

When I first started work, I worked in small insurance UK insurance company. We had a Honeywell H3200, which basically ran IBM1401 code, but with “Improved io”…

We had one small program that was very slow. It was Cobol. We replace lots of conditional performs with “ALTER” statements. It ran about 100 times faster but was totally understandable.

I doubt any change on Hercules will yield such a performance increase. As Michael Jackson said

?

“I am on a world tour. My tour is in pursuit of exceptional beer. That's why they call me the Beer Hunter.”

?

That’s a pursuit I try and emulate to the best of my ability….

… that was by the way Michael the Beer Hunter

?

?

.. another Michael Jackson said, in “Principles of Program Design”

?

?

there are two rules for optimization…

?

  1. Don’t do it
  2. Don’t do it yet.

?

I think we are still at Rule 1. ?You can’t optimize something that doesn’t work…..

… for you to run “Strobe” it had to work…

?

Dave

p.s. I believe that some folks think Michael Jackson the singer and dancer was most famous. Possibly this is true, but I still prefer the works of the two above..

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 15 April 2020 07:12
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Dave,

?

I’m acutely aware of that IBM advice, but in the last two decades I have also been involved in multiple rounds of “MIPS-saving” projects when management wanted application teams to “do more with less” (i.e. increase performance / reduce batch execution times without buying more/bigger hardware).

?

The most effective solutions in those projects were finding the “CPU hot spots” (the Strobe product was always particularly effective for such efforts), and more times than not the worst “hot spots” turned out to be MVCL and sometimes CLCL instructions in compiler-generated COBOL code, and the second worst “hot spots” were the COBOL INITIALIZE verb for large heterogenous structures used inside of a loop, or at every invocation of a subroutine.

?

Finding ways to hoist long-length moves/compares and INITIALIZE verbs out of business processing loops usually yielded the best/largest reductions of CPU and elapsed time used.? Second-best solutions were complicated and usually application-specific adjustments to business processing rules (along the lines of “the fastest I/O or business process is the one not done”).

?

But I digress from the subject at hand.? You and Harold are right here, for Hercules fewer instructions will yield better performance, so if replacement of the C version of MEMSET would dramatically improve performance for C programs then the MVCL solution will probably work best under Hercules.

?

Peter

?

From: [email protected] <[email protected]> On Behalf Of Dave Wade
Sent: Tuesday, April 14, 2020 7:17 PM
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Peter,

I tend not to worry about performance, but any way I believe that IBM current advice is don’t try and instruction timings disappeared from the manuals yonks ago…

For example early 9370 were especially bad on non-aligned instructions, but modern hardware doesn’t give a jot. I suppose it might be nice to try it on the P390 but still trying to re-write a web sit …

Dave

?

From: [email protected] <[email protected]> On Behalf Of pjfarley3
Sent: 14 April 2020 23:24
To: [email protected]
Subject: Re: [h390-vm] memset help

?

Because I did not remember that MVCL was available at the 370 architecture level (and failed to go look it up) and because MVCL has mostly been quite slow at the real-iron hardware level.

?

Of course, Hercules might do MVCL relatively better than the real iron.

?

MEMSETCL.txt using MVCL attached.

?

Peter

?

> -----Original Message-----

> From: [email protected] <[email protected]> On Behalf Of Peter Coghlan

> Sent: Tuesday, April 14, 2020 3:33 AM

> To: [email protected]

> Subject: Re: [h390-vm] memset help

>

<Snipped>

> Why don't you suggest using an MVCL instruction?

>

> Regards,

> Peter Coghlan.

--


 

On Wed, Apr 15, 2020 at 07:11 AM, rvjansen@... wrote:
TR and TRT
?

Anyway I am biased ... as it takes me about 5 mins to write a single line of S/370 assembler not counting debugging!


 

On Wed, Apr 15, 2020 at 07:15 AM, Dave Wade wrote:
I doubt any change on Hercules will yield such a performance increase
Now that sounds like a challenge!

Let's see what we get with GCCLIB IO improvements. I will make sure my test lines are at least 100 characters long :-)


 

开云体育

Translate - like the rexx function but then in assembler.

Anecdote: I once got a performance problem on my desk. It did a character translation, in C. It looped through a string, and replaced character by character.
I noticed it had one flaw: it did not stop when it found the right character, but went on to 255 - for every character. (true story!)

Next day I replaced it by an assembler version with two tables and a translate instruction. The application flew.
(this was on a 486, with XLATB, but it is the same thing. These things are hard to figure out for a compiler).
(this is also why we need drivers for e.g. SSL to use cryptographic assist/acceleration instructions - the compiler won’t do that for you - that would be at bit like clippy telling you “ah, I see you are using a bubblesort, let me replace that by a blockset search").

搁别苍é.
?

On 15 Apr 2020, at 10:02, adriansutherland67 <adrian@...> wrote:

On Wed, Apr 15, 2020 at 07:11 AM, rvjansen@... wrote:
TR and TRT
?

Anyway I am biased ... as it takes me about 5 mins to write a single line of S/370 assembler not counting debugging!