Discussion:
[Bug tree-optimization/61743] New: Complete unroll is not happened for loops with short upper bound
ysrumyan at gmail dot com
2014-07-08 08:50:53 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

Bug ID: 61743
Summary: Complete unroll is not happened for loops with short
upper bound
Product: gcc
Version: 4.10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: ysrumyan at gmail dot com

We discovered significant performance regression on one important benchmark
from eembc2.0 suite after r211625. It turned out that complete unroll didn't
happen.
I prepared simple reproducer that exhibits the problem. If we compile it with
'-Dbtype=int' both innermost loops are unrolled completely by latest trunk
compiler:
test.c:20:5: note: loop with 8 iterations completely unrolled
test.c:11:5: note: loop with 8 iterations completely unrolled
but if '-Dbtype=e_u8' unroll does not happen.
Note also that for this particular test-case compiler built before r211625
performs curoll only for one loop.
ysrumyan at gmail dot com
2014-07-08 08:53:24 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #1 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Created attachment 33088
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33088&action=edit
test-case to reproduce

Use '-O3 -funroll-loops -Dbtype=[int,e_u8]' to reproduce.
rguenth at gcc dot gnu.org
2014-07-08 09:06:21 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

Richard Biener <rguenth at gcc dot gnu.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2014-07-08
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Target Milestone|--- |4.10.0
Summary|Complete unroll is not |[4.10 Regression] Complete
|happened for loops with |unroll is not happened for
|short upper bound |loops with short upper
| |bound
Ever confirmed|0 |1

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
I will have a look.
ysrumyan at gmail dot com
2014-08-07 08:01:30 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #3 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Any comments will be appreciated.
rguenth at gcc dot gnu.org
2014-08-07 10:02:51 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
We fail to determine an upper bound on the number of iterations for those
loops.
With 4.9 I only see one loop completely unrolled, that at t.c:20. 4.8 doesn't
unroll any loop either.

4.9 seems to conclude that by

Statement *pretmp_81[j_51] = _39;
is probably executed at most 7 (bounded by 7) + 1 times in loop 4.

which also is seen on trunk.

It would be interesting to know what makes us decide otherwise in the end
(can you bisect this?)
rguenth at gcc dot gnu.org
2014-08-07 10:06:52 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Err, sorry - you did bisect it already.
rguenth at gcc dot gnu.org
2014-08-07 11:15:59 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Ok, so it's probably the range information on a SSA name we lose because we
restrict retaining that to the case where the new SSA name definition is in
the same basic-block as the original one:

/* If this now constitutes a copy duplicate points-to
and range info appropriately. This is especially
important for inserted code. See tree-ssa-copy.c
for similar code. */
if (sprime
&& TREE_CODE (sprime) == SSA_NAME)
{
basic_block sprime_b = gimple_bb (SSA_NAME_DEF_STMT (sprime));
...
else if (!POINTER_TYPE_P (TREE_TYPE (lhs))
&& SSA_NAME_RANGE_INFO (lhs)
&& !SSA_NAME_RANGE_INFO (sprime)
&& b == sprime_b)
duplicate_ssa_name_range_info (sprime,
SSA_NAME_RANGE_TYPE (lhs),
SSA_NAME_RANGE_INFO (lhs));

this is beause range info may not be valid if not dominated by a check
that constrains it.

In this case this is overly conservative. Removing the b == sprime_b test
fixes the testcase back to unrolling t.c:20 like 4.9 did.

Note that I don't think we can remove the check. Before r211625 we
simply retained an SSA name copy at the original place which is now
copy-propagated out:

# RANGE [4, 8] NONZERO 0x0000000000000000e
_43 = prephitmp_45;

OTOH for 4.9 later copyprop also removes _43 and has no range-info on _45.
But it retains a loop-closed SSA PHI node:

<bb 23>:
# RANGE [4, 8] NONZERO 0x0000000000000000e
# _76 = PHI <_43(10)>
goto <bb 14>;

which is only there because of that initial copy and earlier CSE I guess.
So the 2nd loop does see range info.

So in the end the events are unfortunate. The only way out would be to
make PRE "translate" range-info and set it on PHIs it inserts:

Found partial partial redundancy for expression {nop_expr,n_1} (0018)
Created phi prephitmp_57 = PHI <4(3), 6(17), 8(4)>
in block 5 (0018)

# RANGE [4, 8] NONZERO 14
# n_1 = PHI <4(3), 6(17), 8(4)>
# prephitmp_57 = PHI <4(3), 6(17), 8(4)>
<L11>:

note how _57 misses range-info (but trivially has the same as n_1).

A hack like

Index: gcc/tree-ssa-pre.c
===================================================================
--- gcc/tree-ssa-pre.c (revision 213700)
+++ gcc/tree-ssa-pre.c (working copy)
@@ -3218,6 +3218,28 @@ insert_into_preds_of_block (basic_block
bitmap_insert_into_set (NEW_SETS (block),
newphi);

+ if (expr->kind == NARY
+ && CONVERT_EXPR_CODE_P (expr->u.nary->opcode)
+ && TREE_CODE (expr->u.nary->op[0]) == SSA_NAME
+ && gimple_bb (SSA_NAME_DEF_STMT (expr->u.nary->op[0])) == block
+ && INTEGRAL_TYPE_P (type)
+ && INTEGRAL_TYPE_P (TREE_TYPE (expr->u.nary->op[0]))
+ && (TYPE_PRECISION (type)
+ >= TYPE_PRECISION (TREE_TYPE (expr->u.nary->op[0])))
+ && SSA_NAME_RANGE_INFO (expr->u.nary->op[0]))
+ {
+ wide_int min, max;
+ if (get_range_info (expr->u.nary->op[0], &min, &max) == VR_RANGE
+ && !wi::neg_p (min, SIGNED)
+ && !wi::neg_p (max, SIGNED))
+ set_range_info (temp,
+ SSA_NAME_RANGE_TYPE (expr->u.nary->op[0]),
+ wide_int_storage::from (min, TYPE_PRECISION (type),
+ TYPE_SIGN (type)),
+ wide_int_storage::from (max, TYPE_PRECISION (type),
+ TYPE_SIGN (type)));
+ }
+

handles this very special case (a widening/sign-changing conversion that
doesn't change the range with an original definition in the same BB as
the PHI is placed in - which means the original definition has to be a
PHI node as well).

Get's us back to unrolling t.c.:20
rguenth at gcc dot gnu.org
2014-08-07 11:21:19 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Or we could realize that range-info is important for loop passes and move
the first VRP pass accordingly (after pass_sink_code).
ysrumyan at gmail dot com
2014-08-11 15:02:16 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #8 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,

I tested both proposed fixes and i turned out that the first one is preferable
since performance of benchmark came back. Note that hoisting 2nd vrp pass gave
us another 14% slowdown but I did not investigate it.

Should I tested your fix or you do it yourself?

Thanks.
rguenth at gcc dot gnu.org
2014-08-13 08:35:26 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Yuri Rumyantsev from comment #8)
Post by ysrumyan at gmail dot com
Richard,
I tested both proposed fixes and i turned out that the first one is
preferable since performance of benchmark came back. Note that hoisting 2nd
vrp pass gave us another 14% slowdown but I did not investigate it.
Should I tested your fix or you do it yourself?
I'd like to explore on how to make it a little more "generic", it's a very
special hack as-is.
Post by ysrumyan at gmail dot com
Thanks.
ysrumyan at gmail dot com
2014-09-08 10:58:14 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #10 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,

Do you have any progress?

Thanks.

2014-08-13 12:35 GMT+04:00 rguenth at gcc dot gnu.org
Post by ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Yuri Rumyantsev from comment #8)
Post by ysrumyan at gmail dot com
Richard,
I tested both proposed fixes and i turned out that the first one is
preferable since performance of benchmark came back. Note that hoisting 2nd
vrp pass gave us another 14% slowdown but I did not investigate it.
Should I tested your fix or you do it yourself?
I'd like to explore on how to make it a little more "generic", it's a very
special hack as-is.
Post by ysrumyan at gmail dot com
Thanks.
--
You reported the bug.
rguenther at suse dot de
2014-09-08 11:13:04 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
Post by ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743
--- Comment #10 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,
Do you have any progress?
No, I didn't yet have time to get back to this.
ysrumyan at gmail dot com
2014-10-24 08:26:09 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #12 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,

Did you have a chance to look at this and prepare more general fix?

Thanks.
Yuri.
Post by ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743
--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
Post by ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743
--- Comment #10 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,
Do you have any progress?
No, I didn't yet have time to get back to this.
--
You reported the bug.
rguenther at suse dot de
2014-10-24 08:37:33 UTC
Permalink
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
Post by ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61743
--- Comment #12 from Yuri Rumyantsev <ysrumyan at gmail dot com> ---
Richard,
Did you have a chance to look at this and prepare more general fix?
No, I'm swamped with other work. I plan to look into this during
stage3.

Loading...