Skip to content

basic_optional #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: basic_optional
Choose a base branch
from
Open

basic_optional #1

wants to merge 1 commit into from

Conversation

NinaRanns
Copy link
Owner

basic_optional proof of concept implementation and tests

NinaRanns pushed a commit that referenced this pull request Aug 30, 2023
Here ever since r12-6022-gbb2a7f80a98de3 we stopped deeming the partial
specialization #2 to be more specialized than #1 ultimately because
dependent operator expressions now have a DEPENDENT_OPERATOR_TYPE type
instead of an empty type, and this made unify stop deducing T(2) == 1
for K during partial ordering for #1 and #2.

This minimal patch fixes this by making the relevant logic in unify
treat DEPENDENT_OPERATOR_TYPE like an empty type.

	PR c++/105425

gcc/cp/ChangeLog:

	* pt.cc (unify) <case TEMPLATE_PARM_INDEX>: Treat
	DEPENDENT_OPERATOR_TYPE like an empty type.

gcc/testsuite/ChangeLog:

	* g++.dg/template/partial-specialization13.C: New test.

(cherry picked from commit 509fd16)
NinaRanns pushed a commit that referenced this pull request Aug 30, 2023
(In reply to Uroš Bizjak from comment #1)
> Instruction does not accept memory operand for operand 3:
>
> (define_insn_and_split
> "*<sse4_1>_blendv<ssefltmodesuffix><avxsizesuffix>_ltint"
>   [(set (match_operand:<ssebytemode> 0 "register_operand" "=Yr,*x,x")
> 	(unspec:<ssebytemode>
> 	  [(match_operand:<ssebytemode> 1 "register_operand" "0,0,x")
> 	   (match_operand:<ssebytemode> 2 "vector_operand" "YrBm,*xBm,xm")
> 	   (subreg:<ssebytemode>
> 	     (lt:VI48_AVX
> 	       (match_operand:VI48_AVX 3 "register_operand" "Yz,Yz,x")
> 	       (match_operand:VI48_AVX 4 "const0_operand")) 0)]
> 	  UNSPEC_BLENDV))]
>
> The problematic insn is:
>
> (define_insn_and_split "*avx_cmp<mode>3_ltint_not"
>  [(set (match_operand:VI48_AVX  0 "register_operand")
>        (vec_merge:VI48_AVX
> 	 (match_operand:VI48_AVX 1 "vector_operand")
> 	 (match_operand:VI48_AVX 2 "vector_operand")
> 	 (unspec:<avx512fmaskmode>
> 	   [(subreg:VI48_AVX
> 	    (not:<ssebytemode>
> 	      (match_operand:<ssebytemode> 3 "vector_operand")) 0)
> 	    (match_operand:VI48_AVX 4 "const0_operand")
> 	    (match_operand:SI 5 "const_0_to_7_operand")]
> 	    UNSPEC_PCMP)))]
>
> which gets split to the above pattern.
>
> In the preparation statements we have:
>
>   if (!MEM_P (operands[3]))
>     operands[3] = force_reg (<ssebytemode>mode, operands[3]);
>   operands[3] = lowpart_subreg (<MODE>mode, operands[3], <ssebytemode>mode);
>
> Which won't fly when operand 3 is memory operand...
>

gcc/ChangeLog:

	PR target/105953
	* config/i386/sse.md (*avx_cmp<mode>3_ltint_not): Force_reg
	operands[3].

gcc/testsuite/ChangeLog:

	* g++.target/i386/pr105953.C: New test.
NinaRanns pushed a commit that referenced this pull request Aug 30, 2023
While looking at PR 105549, which is about fixing the ABI break
introduced in GCC 9.1 in parameter alignment with bit-fields, we
noticed that the GCC 9.1 warning is not emitted in all the cases where
it should be.  This patch fixes that and the next patch in the series
fixes the GCC 9.1 break.

We split this into two patches since patch #2 introduces a new ABI
break starting with GCC 13.1.  This way, patch #1 can be back-ported
to release branches if needed to fix the GCC 9.1 warning issue.

The main idea is to add a new global boolean that indicates whether
we're expanding the start of a function, so that aarch64_layout_arg
can emit warnings for callees as well as callers.  This removes the
need for aarch64_function_arg_boundary to warn (with its incomplete
information).  However, in the first patch there are still cases where
we emit warnings were we should not; this is fixed in patch #2 where
we can distinguish between GCC 9.1 and GCC.13.1 ABI breaks properly.

The fix in aarch64_function_arg_boundary (replacing & with &&) looks
like an oversight of a previous commit in this area which changed
'abi_break' from a boolean to an integer.

We also take the opportunity to fix the comment above
aarch64_function_arg_alignment since the value of the abi_break
parameter was changed in a previous commit, no longer matching the
description.

2022-11-28  Christophe Lyon  <christophe.lyon@arm.com>
	    Richard Sandiford  <richard.sandiford@arm.com>

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Fix
	comment.
	(aarch64_layout_arg): Factorize warning conditions.
	(aarch64_function_arg_boundary): Fix typo.
	* function.cc (currently_expanding_function_start): New variable.
	(expand_function_start): Handle
	currently_expanding_function_start.
	* function.h (currently_expanding_function_start): Declare.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/bitfield-abi-warning-align16-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning-align16-O2-extra.c: New
	test.
	* gcc.target/aarch64/bitfield-abi-warning-align32-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning-align32-O2-extra.c: New
	test.
	* gcc.target/aarch64/bitfield-abi-warning-align8-O2.c: New test.
	* gcc.target/aarch64/bitfield-abi-warning.h: New test.
	* g++.target/aarch64/bitfield-abi-warning-align16-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning-align16-O2-extra.C: New
	test.
	* g++.target/aarch64/bitfield-abi-warning-align32-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning-align32-O2-extra.C: New
	test.
	* g++.target/aarch64/bitfield-abi-warning-align8-O2.C: New test.
	* g++.target/aarch64/bitfield-abi-warning.h: New test.

(cherry picked from commit 3df1a11)
NinaRanns pushed a commit that referenced this pull request Mar 5, 2024
Here the ahead-of-time overload set pruning in finish_call_expr is
unintentionally returning a CALL_EXPR whose (pruned) callee is wrapped
in an ADDR_EXPR, despite the original callee not being wrapped in an
ADDR_EXPR.  This ends up causing a bogus declaration mismatch error in
the below testcase because the call to min in #1 gets expressed as a
CALL_EXPR of ADDR_EXPR of FUNCTION_DECL, whereas the level-lowered call
to min in #2 gets expressed instead as a CALL_EXPR of FUNCTION_DECL.

This patch fixes this by stripping the spurious ADDR_EXPR appropriately.
Thus the first call to min now also gets expressed as a CALL_EXPR of
FUNCTION_DECL, matching the behavior before r12-6075-g2decd2cabe5a4f.

	PR c++/107461

gcc/cp/ChangeLog:

	* semantics.cc (finish_call_expr): Strip ADDR_EXPR from
	the selected callee during overload set pruning.

gcc/testsuite/ChangeLog:

	* g++.dg/template/call9.C: New test.

(cherry picked from commit 59e0376)
NinaRanns pushed a commit that referenced this pull request Mar 5, 2024
After r13-5684-g59e0376f607805 the (pruned) callee of a non-dependent
CALL_EXPR is a bare FUNCTION_DECL rather than ADDR_EXPR of FUNCTION_DECL.
This innocent change revealed that cp_tree_equal doesn't first check
dependence of a CALL_EXPR before treating a FUNCTION_DECL callee as a
dependent name, which leads to us incorrectly accepting the first two
testcases below and rejecting the third:

 * In the first testcase, cp_tree_equal incorrectly returns true for
   the two non-dependent CALL_EXPRs f(0) and f(0) (whose CALL_EXPR_FN
   are different FUNCTION_DECLs) which causes us to treat #2 as a
   redeclaration of #1.

 * Same issue in the second testcase, for f<int*>() and f<char>().

 * In the third testcase, cp_tree_equal incorrectly returns true for
   f<int>() and f<void(*)(int)>() which causes us to conflate the two
   dependent specializations A<decltype(f<int>()(U()))> and
   A<decltype(f<void(*)(int)>()(U()))>.

This patch fixes this by making called_fns_equal treat two callees as
dependent names only if the overall CALL_EXPRs are dependent, via a new
convenience function call_expr_dependent_name that is like dependent_name
but also checks dependence of the overall CALL_EXPR.

	PR c++/107461

gcc/cp/ChangeLog:

	* cp-tree.h (call_expr_dependent_name): Declare.
	* pt.cc (iterative_hash_template_arg) <case CALL_EXPR>: Use
	call_expr_dependent_name instead of dependent_name.
	* tree.cc (call_expr_dependent_name): Define.
	(called_fns_equal): Adjust to take two CALL_EXPRs instead of
	CALL_EXPR_FNs thereof.  Use call_expr_dependent_name instead
	of dependent_name.
	(cp_tree_equal) <case CALL_EXPR>: Adjust call to called_fns_equal.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/overload5.C: New test.
	* g++.dg/cpp0x/overload5a.C: New test.
	* g++.dg/cpp0x/overload6.C: New test.

(cherry picked from commit 3192466)
NinaRanns pushed a commit that referenced this pull request Mar 5, 2024
-Wmismatched-tags warns about the (harmless) struct/class mismatch.
For, e.g.,

  template<typename T> struct A { };
  class A<int> a;

it works by adding A<T> to the class2loc hash table while parsing the
class-head and then, while parsing the elaborate type-specifier, we
add A<int>.  At the end of c_parse_file we go through the table and
warn about the class-key mismatches.  In this PR we crash though; we
have

  template<typename T> struct A {
    template<typename U> struct W { };
  };
  struct A<int>::W<int> w; // #1

where while parsing A and #1 we've stashed
   A<T>
   A<T>::W<U>
   A<int>::W<int>
into class2loc.  Then in class_decl_loc_t::diag_mismatched_tags TYPE
is A<int>::W<int>, and specialization_of gets us A<int>::W<U>, which
is not in class2loc, so we crash on gcc_assert (cdlguide).  But it's
OK not to have found A<int>::W<U>, we should just look one "level" up,
that is, A<T>::W<U>.

It's important to handle class specializations, so e.g.

  template<>
  struct A<char> {
    template<typename U>
    class W { };
  };

where W's class-key is different than in the primary template above,
so we should warn depending on whether we're looking into A<char>
or into a different instantiation.

	PR c++/106259

gcc/cp/ChangeLog:

	* parser.cc (class_decl_loc_t::diag_mismatched_tags): If the first
	lookup of SPEC didn't find anything, try to look for
	most_general_template.

gcc/testsuite/ChangeLog:

	* g++.dg/warn/Wmismatched-tags-11.C: New test.

(cherry picked from commit 71afd06)
NinaRanns pushed a commit that referenced this pull request Mar 5, 2024
Hi all,

We noticed that calls to the vadcq and vsbcq intrinsics, both of
which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in
the FPSCR, would produce the following code:

```
< r2 is the *carry input >
vmrs	r3, FPSCR_nzcvqc
bic	r3, r3, #536870912
orr	r3, r3, r2, lsl #29
vmsr	FPSCR_nzcvqc, r3
```

when the MVE ACLE instead gives a different instruction sequence of:
```
< Rt is the *carry input >
VMRS Rs,FPSCR_nzcvqc
BFI Rs,Rt,#29,#1
VMSR FPSCR_nzcvqc,Rs
```

the bic + orr pair is slower and it's also wrong, because, if the
*carry input is greater than 1, then we risk overwriting the top two
bits of the FPSCR register (the N and Z flags).

This turned out to be a problem in the header file and the solution was
to simply add a `& 1x0u` to the `*carry` input: then the compiler knows
that we only care about the lowest bit and can optimise to a BFI.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:

	* config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic.
	(__arm_vadcq_u32): Likewise.
	(__arm_vadcq_m_s32): Likewise.
	(__arm_vadcq_m_u32): Likewise.
	(__arm_vsbcq_s32): Likewise.
	(__arm_vsbcq_u32): Likewise.
	(__arm_vsbcq_m_s32): Likewise.
	(__arm_vsbcq_m_u32): Likewise.
	* config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile.

gcc/testsuite/ChangeLog:
	* gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New.

(cherry picked from commit f1417d051be094ffbce228e11951f3e12e8fca1c)
NinaRanns pushed a commit that referenced this pull request May 16, 2024
We evaluate constexpr functions on the original, pre-genericization bodies.
That means that the function body we're evaluating will not have gone
through cp_genericize_r's "Map block scope extern declarations to visible
declarations with the same name and type in outer scopes if any".  Here:

  constexpr bool bar() { return true; } // #1
  constexpr bool foo() {
    constexpr bool bar(void); // #2
    return bar();
  }

it means that we:
1) register_constexpr_fundef (#1)
2) cp_genericize (#1)
   nothing interesting happens
3) register_constexpr_fundef (foo)
   does copy_fn, so we have two copies of the BIND_EXPR
4) cp_genericize (foo)
   this remaps #2 to #1, but only on one copy of the BIND_EXPR
5) retrieve_constexpr_fundef (foo)
   we find it, no problem
6) retrieve_constexpr_fundef (#2)
   and here #2 isn't found in constexpr_fundef_table, because
   we're working on the BIND_EXPR copy where #2 wasn't mapped to #1
   so we fail.  We've only registered #1.

It should work to use DECL_LOCAL_DECL_ALIAS (which used to be
extern_decl_map).  We evaluate constexpr functions on pre-cp_fold
bodies to avoid diagnostic problems, but the remapping I'm proposing
should not interfere with diagnostics.

This is not a problem for a global scope redeclaration; there we go
through duplicate_decls which keeps the DECL_UID:
  DECL_UID (olddecl) = olddecl_uid;
and DECL_UID is what constexpr_fundef_hasher::hash uses.

	PR c++/111132

gcc/cp/ChangeLog:

	* constexpr.cc (get_function_named_in_call): Use
	cp_get_fndecl_from_callee.
	* cvt.cc (cp_get_fndecl_from_callee): If there's a
	DECL_LOCAL_DECL_ALIAS, use it.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/constexpr-redeclaration3.C: New test.
	* g++.dg/cpp0x/constexpr-redeclaration4.C: New test.
NinaRanns pushed a commit that referenced this pull request May 16, 2024
aarch64-sve.md had a pattern that combined:

	cmpeq	pb.T, pa/z, zc.T, #0
	mov	zd.T, pb/z, #1

into:

	cnot	zd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
	PR target/114603
	* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot<mode>): Replace
	with...
	(@aarch64_ptrue_cnot<mode>): ...this, requiring operand 1 to be
	a ptrue.
	(*cnot<mode>): Require operand 1 to be a ptrue.
	* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
	Use aarch64_ptrue_cnot<mode> for _x operations that are predicated
	with a ptrue.  Represent other _x operations as fully-defined _m
	operations.

gcc/testsuite/
	PR target/114603
	* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.
NinaRanns pushed a commit that referenced this pull request May 16, 2024
…. [PR114741]

In PR114741 we see that we have a regression in codegen when SVE is enable where
the simple testcase:

void foo(unsigned v, unsigned *p)
{
    *p = v & 1;
}

generates

foo:
        fmov    s31, w0
        and     z31.s, z31.s, #1
        str     s31, [x1]
        ret

instead of:

foo:
        and     w0, w0, 1
        str     w0, [x1]
        ret

This causes an impact it not just codesize but also performance.  This is caused
by the use of the ^ constraint modifier in the pattern <optab><mode>3.

The documentation states that this modifier should only have an effect on the
alternative costing in that a particular alternative is to be preferred unless
a non-psuedo reload is needed.

The pattern was trying to convey that whenever both r and w are required, that
it should prefer r unless a reload is needed.  This is because if a reload is
needed then we can construct the constants more flexibly on the SIMD side.

We were using this so simplify the implementation and to get generic cases such
as:

double negabs (double x)
{
   unsigned long long y;
   memcpy (&y, &x, sizeof(double));
   y = y | (1UL << 63);
   memcpy (&x, &y, sizeof(double));
   return x;
}

which don't go through an expander.
However the implementation of ^ in the register allocator is not according to
the documentation in that it also has an effect during coloring.  During initial
register class selection it applies a penalty to a class, similar to how ? does.

In this example the penalty makes the use of GP regs expensive enough that it no
longer considers them:

    r106: preferred FP_REGS, alternative NO_REGS, allocno FP_REGS
;;        3--> b  0: i   9 r106=r105&0x1
    :cortex_a53_slot_any:GENERAL_REGS+0(-1)FP_REGS+1(1)PR_LO_REGS+0(0)
                         PR_HI_REGS+0(0):model 4

which is not the expected behavior.  For GCC 14 this is a conservative fix.

1. we remove the ^ modifier from the logical optabs.

2. In order not to regress copysign we then move the copysign expansion to
   directly use the SIMD variant.  Since copysign only supports floating point
   modes this is fine and no longer relies on the register allocator to select
   the right alternative.

It once again regresses the general case, but this case wasn't optimized in
earlier GCCs either so it's not a regression in GCC 14.  This change gives
strict better codegen than earlier GCCs and still optimizes the important cases.

gcc/ChangeLog:

	PR target/114741
	* config/aarch64/aarch64.md (<optab><mode>3): Remove ^ from alt 2.
	(copysign<GPF:mode>3): Use SIMD version of IOR directly.

gcc/testsuite/ChangeLog:

	PR target/114741
	* gcc.target/aarch64/fneg-abs_2.c: Update codegen.
	* gcc.target/aarch64/fneg-abs_4.c: xfail for now.
	* gcc.target/aarch64/pr114741.c: New test.
NinaRanns pushed a commit that referenced this pull request May 16, 2024
The PR complains that

  void do_something(){
    #pragma GCC diagnostic push
    #pragma GCC diagnostic ignored "-Wunused-label"
    start:;
    #pragma GCC diagnostic pop
  } #1

doesn't work.  That's because we warn_for_unused_label only while we're
in finish_function, meaning we're at #1 where we're outside the #pragma
region.  We can use suppress_warning + warning_suppressed_p to fix this.

Note that I'm not using TREE_USED.  Propagating it in tsubst_stmt/LABEL_EXPR
from decl to label would mean that we don't warn in do_something2, but
I think we want the warning there: we're in a template and the goto is
a discarded statement.

	PR c++/113582

gcc/c-family/ChangeLog:

	* c-warn.cc (warn_for_unused_label): Don't warn if -Wunused-label has
	been suppressed for the label.

gcc/cp/ChangeLog:

	* parser.cc (cp_parser_label_for_labeled_statement): suppress_warning
	if it's not enabled at input_location.
	* pt.cc (tsubst_stmt): Call copy_warning.

gcc/testsuite/ChangeLog:

	* g++.dg/warn/Wunused-label-4.C: New test.
NinaRanns pushed a commit that referenced this pull request May 16, 2024
This patch would like to fix below format issue of trailing operator.

=== ERROR type #1: trailing operator (4 error(s)) ===
gcc/config/riscv/riscv-vector-builtins.cc:4641:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_16) &&
gcc/config/riscv/riscv-vector-builtins.cc:4651:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_32) &&
gcc/config/riscv/riscv-vector-builtins.cc:4661:39:  if ((exts &
RVV_REQUIRE_ELEN_FP_64) &&
gcc/config/riscv/riscv-vector-builtins.cc:4670:36:  if ((exts &
RVV_REQUIRE_ELEN_64) &&

Passed the ./contrib/check_GNU_style.sh for this patch,  and double
checked there is no other format issue of the original patch.

Committed as format change.

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins.cc
	(validate_instance_type_required_extensions): Remove the
	operator from the trailing and put it to new line.

Signed-off-by: Pan Li <pan2.li@intel.com>
NinaRanns pushed a commit that referenced this pull request May 24, 2024
resolving an error while parsing an attribute like contract
NinaRanns pushed a commit that referenced this pull request Jul 9, 2024
Here during overload resolution we have two strictly viable ambiguous
candidates #1 and #2, and two non-strictly viable candidates #3 and #4
which we hold on to ever since r14-6522.  These latter candidates have
an empty second arg conversion since the first arg conversion was deemed
bad, and this trips up joust when called on #3 and #4 which assumes all
arg conversions are there.

We can fix this by making joust robust to empty arg conversions, but in
this situation we shouldn't need to compare #3 and #4 at all given that
we have a strictly viable candidate.  To that end, this patch makes
tourney shortcut considering non-strictly viable candidates upon
encountering ambiguity between two strictly viable candidates (taking
advantage of the fact that the candidates list is sorted according to
viability via splice_viable).

	PR c++/115239

gcc/cp/ChangeLog:

	* call.cc (tourney): Don't consider a non-strictly viable
	candidate as the champ if there was ambiguity between two
	strictly viable candidates.

gcc/testsuite/ChangeLog:

	* g++.dg/overload/error7.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
NinaRanns pushed a commit that referenced this pull request Jul 9, 2024
This patch improves GCC’s vectorization of __builtin_popcount for aarch64 target
by adding popcount patterns for vector modes besides QImode, i.e., HImode,
SImode and DImode.

With this patch, we now generate the following for V8HI:
  cnt     v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b

For V4HI, we generate:
  cnt     v1.8b, v0.8b
  uaddlp  v2.4h, v1.8b

For V4SI, we generate:
  cnt     v1.16b, v0.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h

For V4SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.4s, #0
  movi    v1.16b, #1
  cnt     v3.16b, v2.16b
  udot    v0.4s, v3.16b, v1.16b

For V2SI, we generate:
  cnt     v1.8b, v.8b
  uaddlp  v2.4h, v1.8b
  uaddlp  v3.2s, v2.4h

For V2SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.8b, #0
  movi    v1.8b, #1
  cnt     v3.8b, v2.8b
  udot    v0.2s, v3.8b, v1.8b

For V2DI, we generate:
  cnt     v1.16b, v.16b
  uaddlp  v2.8h, v1.16b
  uaddlp  v3.4s, v2.8h
  uaddlp  v4.2d, v3.4s

For V4SI with TARGET_DOTPROD, we generate the following instead:
  movi    v0.4s, #0
  movi    v1.16b, #1
  cnt     v3.16b, v2.16b
  udot    v0.4s, v3.16b, v1.16b
  uaddlp  v0.2d, v0.4s

	PR target/113859

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md (aarch64_<su>addlp<mode>): Rename to...
	(@aarch64_<su>addlp<mode>): ... This.
	(popcount<mode>2): New define_expand.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/popcnt-udot.c: New test.
	* gcc.target/aarch64/popcnt-vec.c: New test.

Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
NinaRanns pushed a commit that referenced this pull request Sep 26, 2024
On many cores, including Neoverse V2 the throughput of vector ADD
instructions is higher than vector shifts like SHL.  We can lean on that
to emit code like:
  add     v0.4s, v0.4s, v0.4s
instead of:
  shl     v0.4s, v0.4s, 1

LLVM already does this trick.
In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I
opted to instead do this at the final assembly printing stage, similar
to how we emit CMLT instead of SSHR elsewhere in the backend.

I'd like to also do this for SVE shifts, but those will have to be
separate patches.

Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new
	syntax.  Add =w,w,vs1 alternative.
	* config/aarch64/constraints.md (vs1): New constraint.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/advsimd_shl_add.c: New test.
NinaRanns pushed a commit that referenced this pull request Sep 26, 2024
This patch tweaks timode_scalar_chain::compute_convert_gain to better
reflect the expansion of V1TImode arithmetic right shifts by the i386
backend.  The comment "see ix86_expand_v1ti_ashiftrt" appears after
"case ASHIFTRT" in compute_convert_gain, and the changes below attempt
to better match the logic used there.

The original motivating example is:

__int128 m1;
void foo()
{
  m1 = (m1 << 8) >> 8;
}

which with -O2 -mavx2 we fail to convert to vector form due to the
inappropriate cost of the arithmetic right shift.

  Instruction gain -16 for     7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
  Total gain: -3
  Chain #1 conversion is not profitable

This is reporting that the ASHIFTRT is four instructions worse using
vectors than in scalar form, which is incorrect as the AVX2 expansion
of this shift only requires three instructions (and the scalar form
requires two).

With more accurate costs in timode_scalar_chain::compute_convert_gain
we now see (with -O2 -mavx2):

  Instruction gain -4 for     7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
  Total gain: 9
  Converting chain #1...

which results in:

foo:	vmovdqa m1(%rip), %xmm0
        vpslldq $1, %xmm0, %xmm0
        vpsrad  $8, %xmm0, %xmm1
        vpsrldq $1, %xmm0, %xmm0
        vpblendd        $7, %xmm0, %xmm1, %xmm0
        vmovdqa %xmm0, m1(%rip)
        ret

2024-08-25  Roger Sayle  <roger@nextmovesoftware.com>
	    Uros Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog
	* config/i386/i386-features.cc (compute_convert_gain)
	<case ASHIFTRT>: Update to match ix86_expand_v1ti_ashiftrt.
NinaRanns pushed a commit that referenced this pull request Sep 26, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <jakub@redhat.com>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.
NinaRanns pushed a commit that referenced this pull request Sep 26, 2024
On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift
instructions like SHL have a throughput of 2. We can lean on that to emit code
like:
 add	z31.b, z31.b, z31.b
instead of:
 lsl	z31.b, z31.b, #1

The implementation of this change for SVE vectors is similar to a prior patch
<https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds
the above functionality for Neon vectors.

Here, the machine descriptor pattern is split up to separately accommodate left
and right shifts, so we can specifically emit an add for all left shifts by 1.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Soumya AR <soumyaa@nvidia.com>

gcc/ChangeLog:

	* config/aarch64/aarch64-sve.md (*post_ra_v<optab><mode>3): Split pattern
	to accomodate left and right shifts separately.
	(*post_ra_v_ashl<mode>3): Matches left shifts with additional
	constraint to check for shifts by 1.
	(*post_ra_v_<optab><mode>3): Matches right shifts.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1
	with corresponding add.
	* gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise.
	* gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise.
	* gcc.target/aarch64/sve/adr_1.c: Likewise.
	* gcc.target/aarch64/sve/adr_6.c: Likewise.
	* gcc.target/aarch64/sve/cond_mla_7.c: Likewise.
	* gcc.target/aarch64/sve/cond_mla_8.c: Likewise.
	* gcc.target/aarch64/sve/shift_2.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise.
	* gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise.
	* gcc.target/aarch64/sve/sve_shl_add.c: New test.
NinaRanns pushed a commit that referenced this pull request Sep 26, 2024
…on [PR113328]

SVE's INDEX instruction can be used to populate vectors by values starting from
"base" and incremented by "step" for each subsequent value. We can take
advantage of it to generate vector constants if TARGET_SVE is available and the
base and step values are within [-16, 15].

For example, with the following function:

typedef int v4si __attribute__ ((vector_size (16)));
v4si
f_v4si (void)
{
  return (v4si){ 0, 1, 2, 3 };
}

GCC currently generates:

f_v4si:
	adrp    x0, .LC4
	ldr     q0, [x0, #:lo12:.LC4]
	ret

.LC4:
	.word   0
	.word   1
	.word   2
	.word   3

With this patch, we generate an INDEX instruction instead if TARGET_SVE is
available.

f_v4si:
	index   z0.s, #0, #1
	ret

	PR target/113328

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_simd_valid_immediate): Improve
	handling of some ADVSIMD vectors by using SVE's INDEX if TARGET_SVE is
	available.
	(aarch64_output_simd_mov_immediate): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/acle/general/dupq_1.c: Update test to use
	SVE's INDEX instruction.
	* gcc.target/aarch64/sve/acle/general/dupq_2.c: Likewise.
	* gcc.target/aarch64/sve/acle/general/dupq_3.c: Likewise.
	* gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise.
	* gcc.target/aarch64/sve/vec_init_3.c: New test.

Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
iains pushed a commit that referenced this pull request Nov 6, 2024
Implement vddup and vidup using the new MVE builtins framework.

We generate better code because we take advantage of the two outputs
produced by the v[id]dup instructions.

For instance, before:
	ldr	r3, [r0]
	sub	r2, r3, #8
	str	r2, [r0]
	mov	r2, r3
	vddup.u16	q3, r2, #1

now:
	ldr	r2, [r0]
	vddup.u16	q3, r2, #1
	str	r2, [r0]

2024-08-21  Christophe Lyon  <christophe.lyon@linaro.org>

	gcc/
	* config/arm/arm-mve-builtins-base.cc (class viddup_impl): New.
	(vddup): New.
	(vidup): New.
	* config/arm/arm-mve-builtins-base.def (vddupq): New.
	(vidupq): New.
	* config/arm/arm-mve-builtins-base.h (vddupq): New.
	(vidupq): New.
	* config/arm/arm_mve.h (vddupq_m): Delete.
	(vddupq_u8): Delete.
	(vddupq_u32): Delete.
	(vddupq_u16): Delete.
	(vidupq_m): Delete.
	(vidupq_u8): Delete.
	(vidupq_u32): Delete.
	(vidupq_u16): Delete.
	(vddupq_x_u8): Delete.
	(vddupq_x_u16): Delete.
	(vddupq_x_u32): Delete.
	(vidupq_x_u8): Delete.
	(vidupq_x_u16): Delete.
	(vidupq_x_u32): Delete.
	(vddupq_m_n_u8): Delete.
	(vddupq_m_n_u32): Delete.
	(vddupq_m_n_u16): Delete.
	(vddupq_m_wb_u8): Delete.
	(vddupq_m_wb_u16): Delete.
	(vddupq_m_wb_u32): Delete.
	(vddupq_n_u8): Delete.
	(vddupq_n_u32): Delete.
	(vddupq_n_u16): Delete.
	(vddupq_wb_u8): Delete.
	(vddupq_wb_u16): Delete.
	(vddupq_wb_u32): Delete.
	(vidupq_m_n_u8): Delete.
	(vidupq_m_n_u32): Delete.
	(vidupq_m_n_u16): Delete.
	(vidupq_m_wb_u8): Delete.
	(vidupq_m_wb_u16): Delete.
	(vidupq_m_wb_u32): Delete.
	(vidupq_n_u8): Delete.
	(vidupq_n_u32): Delete.
	(vidupq_n_u16): Delete.
	(vidupq_wb_u8): Delete.
	(vidupq_wb_u16): Delete.
	(vidupq_wb_u32): Delete.
	(vddupq_x_n_u8): Delete.
	(vddupq_x_n_u16): Delete.
	(vddupq_x_n_u32): Delete.
	(vddupq_x_wb_u8): Delete.
	(vddupq_x_wb_u16): Delete.
	(vddupq_x_wb_u32): Delete.
	(vidupq_x_n_u8): Delete.
	(vidupq_x_n_u16): Delete.
	(vidupq_x_n_u32): Delete.
	(vidupq_x_wb_u8): Delete.
	(vidupq_x_wb_u16): Delete.
	(vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m_n_u8): Delete.
	(__arm_vddupq_m_n_u32): Delete.
	(__arm_vddupq_m_n_u16): Delete.
	(__arm_vddupq_m_wb_u8): Delete.
	(__arm_vddupq_m_wb_u16): Delete.
	(__arm_vddupq_m_wb_u32): Delete.
	(__arm_vddupq_n_u8): Delete.
	(__arm_vddupq_n_u32): Delete.
	(__arm_vddupq_n_u16): Delete.
	(__arm_vidupq_m_n_u8): Delete.
	(__arm_vidupq_m_n_u32): Delete.
	(__arm_vidupq_m_n_u16): Delete.
	(__arm_vidupq_n_u8): Delete.
	(__arm_vidupq_m_wb_u8): Delete.
	(__arm_vidupq_m_wb_u16): Delete.
	(__arm_vidupq_m_wb_u32): Delete.
	(__arm_vidupq_n_u32): Delete.
	(__arm_vidupq_n_u16): Delete.
	(__arm_vidupq_wb_u8): Delete.
	(__arm_vidupq_wb_u16): Delete.
	(__arm_vidupq_wb_u32): Delete.
	(__arm_vddupq_wb_u8): Delete.
	(__arm_vddupq_wb_u16): Delete.
	(__arm_vddupq_wb_u32): Delete.
	(__arm_vddupq_x_n_u8): Delete.
	(__arm_vddupq_x_n_u16): Delete.
	(__arm_vddupq_x_n_u32): Delete.
	(__arm_vddupq_x_wb_u8): Delete.
	(__arm_vddupq_x_wb_u16): Delete.
	(__arm_vddupq_x_wb_u32): Delete.
	(__arm_vidupq_x_n_u8): Delete.
	(__arm_vidupq_x_n_u16): Delete.
	(__arm_vidupq_x_n_u32): Delete.
	(__arm_vidupq_x_wb_u8): Delete.
	(__arm_vidupq_x_wb_u16): Delete.
	(__arm_vidupq_x_wb_u32): Delete.
	(__arm_vddupq_m): Delete.
	(__arm_vddupq_u8): Delete.
	(__arm_vddupq_u32): Delete.
	(__arm_vddupq_u16): Delete.
	(__arm_vidupq_m): Delete.
	(__arm_vidupq_u8): Delete.
	(__arm_vidupq_u32): Delete.
	(__arm_vidupq_u16): Delete.
	(__arm_vddupq_x_u8): Delete.
	(__arm_vddupq_x_u16): Delete.
	(__arm_vddupq_x_u32): Delete.
	(__arm_vidupq_x_u8): Delete.
	(__arm_vidupq_x_u16): Delete.
	(__arm_vidupq_x_u32): Delete.
iains pushed a commit that referenced this pull request Nov 6, 2024
gcc.dg/torture/pr112305.c contains an inner loop that executes
0x8000_0014 times and an outer loop that executes 5 times, giving about
10 billion total executions of the inner loop body.  At -O2 and above we
are able to remove the inner loop, but at -O1 we keep a no-op loop:

        dls     lr, r3
.L3:
        subs    r3, r3, #1
        le      lr, .L3

and at -O0 we of course don't optimise.

This can lead to long execution times on simulators, possibly
triggering a timeout.

gcc/testsuite
	* gcc.dg/torture/pr112305.c: Skip at -O0 and -O1 for simulators.
iains pushed a commit that referenced this pull request Nov 6, 2024
We currently crash upon the following invalid code (notice the "void
void**" parameter)

=== cut here ===
using size_t = decltype(sizeof(int));
void *operator new(size_t, void void **p) noexcept { return p; }
int x;
void f() {
    int y;
    new (&y) int(x);
}
=== cut here ===

The problem is that in this case, we end up with a NULL_TREE parameter
list for the new operator because of the error, and (1) coerce_new_type
wrongly complains about the first parameter type not being size_t,
(2) std_placement_new_fn_p blindly accesses the parameter list, hence a
crash.

This patch does NOT address #1 since we can't easily distinguish between
a new operator declaration without parameters from one with erroneous
parameters (and it's not worth the risk to refactor and break things for
an error recovery issue) hence a dg-bogus in new52.C, but it does
address #2 and the ICE by simply checking the first parameter against
NULL_TREE.

It also adds a new testcase checking that we complain about new
operators with no or invalid first parameters, since we did not have
any.

	PR c++/117101

gcc/cp/ChangeLog:

	* init.cc (std_placement_new_fn_p): Check first_arg against
	NULL_TREE.

gcc/testsuite/ChangeLog:

	* g++.dg/init/new52.C: New test.
	* g++.dg/init/new53.C: New test.
NinaRanns pushed a commit that referenced this pull request Nov 26, 2024
Update test case for armv8.1-m.main that supports conditional
arithmetic.

armv7-m:
        push    {r4, lr}
        ldr     r4, .L6
        ldr     r4, [r4]
        lsls    r4, r4, #29
        it      mi
        addmi   r2, r2, #1
        bl      bar
        movs    r0, #0
        pop     {r4, pc}

armv8.1-m.main:
        push    {r3, r4, r5, lr}
        ldr     r4, .L5
        ldr     r5, [r4]
        tst     r5, #4
        csinc   r2, r2, r2, eq
        bl      bar
        movs    r0, #0
        pop     {r3, r4, r5, pc}

gcc/testsuite/ChangeLog:

	* gcc.target/arm/epilog-1.c: Use check-function-bodies.

Signed-off-by: Torbjörn SVENSSON <torbjorn.svensson@foss.st.com>
NinaRanns pushed a commit that referenced this pull request Nov 26, 2024
The second source register of insn "*extzvsi-1bit_addsubx" cannot be the
same as the destination register, because that register will be overwritten
with an intermediate value after insn splitting.

     /* example #1 */
     int test1(int b, int a) {
       return ((a & 1024) ? 4 : 0) + b;
     }

     ;; result #1 (incorrect)
     test1:
     	extui	a2, a3, 10, 1	;; overwrites A2 before used
     	addx4	a2, a2, a2
     	ret.n

This patch fixes that.

     ;; result #1 (correct)
     test1:
     	extui	a3, a3, 10, 1	;; uses A3 and then overwrites
     	addx4	a2, a3, a2
     	ret.n

However, it should be noted that the first source register can be the same
as the destination without any problems.

     /* example #2 */
     int test2(int a, int b) {
       return ((a & 1024) ? 4 : 0) + b;
     }

     ;; result (correct)
     test2:
     	extui	a2, a2, 10, 1	;; uses A2 and then overwrites
     	addx4	a2, a2, a3
     	ret.n

gcc/ChangeLog:

	* config/xtensa/xtensa.md (*extzvsi-1bit_addsubx):
	Add '&' to the destination register constraint to indicate that
	it is 'earlyclobber', append '0' to the first source register
	constraint to indicate that it can be the same as the destination
	register, and change the split condition from 1 to reload_completed
	so that the insn will be split only after RA in order to obtain
	allocated registers that satisfy the above constraints.
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
vec.h has this method:

  template<typename T, typename A>
  inline T *
  vec_safe_push (vec<T, A, vl_embed> *&v, const T &obj CXX_MEM_STAT_INFO)

where v is a reference to a pointer to vec.  This matches the regex for
VecPrinter, so gdbhooks.py attempts to print it but chokes on the reference.
I see the following:

  #1  0x0000000002b84b7b in vec_safe_push<edge_def*, va_gc> (v=Traceback (most
  recent call last):
    File "$SRC/gcc/gcc/gdbhooks.py", line 486, in to_string
      return '0x%x' % intptr(self.gdbval)
    File "$SRC/gcc/gcc/gdbhooks.py", line 168, in intptr
      return long(gdbval) if sys.version_info.major == 2 else int(gdbval)
  gdb.error: Cannot convert value to long.

This patch makes VecPrinter handle such references by stripping them
(dereferencing) at the top of the relevant functions.

gcc/ChangeLog:

	* gdbhooks.py (strip_ref): New. Use it ...
	(VecPrinter.to_string): ... here,
	(VecPrinter.children): ... and here.
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
Brief:
The bug appears in LRA after rematerialization pass while creating live ranges.
File lra.cc:
*************************************************************
      /* Now we know what pseudos should be spilled.  Try to
	 rematerialize them first.  */
      if (lra_remat ())
	{
	  /* We need full live info -- see the comment above.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
*************************************************************
Wrong call `lra_create_live_ranges (lra_reg_spill_p, true)'
It have to be `lra_create_live_ranges (true, true)'.

The explanation:
**********************************
int main (void)
{
  if (a.u33 * a.u33 != 0)
------^^^^^^^^^^^^^
    goto abrt;
  if (a.u33 * a.u40 * a.u33 != 0)
**********************************
The bug appears here.

Part of the expression `a.u33 * a.u33'
Before LRA:
*************************************************************
(insn 13 11 15 2 (set (reg:QI 184 [ _1+3 ])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 15 13 16 2 (set (reg:QI 64 [ a+4 ])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 16 15 20 2 (set (reg:QI 185 [ _1+4 ])
        (zero_extract:QI (reg:QI 64 [ a+4 ])
            (const_int 1 [0x1])
            (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
     (nil))
*************************************************************

After LRA:
*************************************************************
(insn 587 11 13 2 (set (reg:QI 24 r24 [368])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 3 [0x3]))) [1 a+3 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 13 587 15 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28)
                (const_int 1 [0x1])) [4 %sfp+1 S1 A8])
        (reg:QI 24 r24 [368])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 15 13 16 2 (set (reg:QI 6 r6 [orig:64 a+4 ] [64])
        (mem/c:QI (const:HI (plus:HI (symbol_ref:HI ("a") [flags 0x2]  <var_decl 0x7c866435d000 a>)
                    (const_int 4 [0x4]))) [1 a+4 S1 A8])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
(insn 16 15 572 2 (set (reg:QI 24 r24 [orig:185 _1+4 ] [185])
        (zero_extract:QI (reg:QI 6 r6 [orig:64 a+4 ] [64])
            (const_int 1 [0x1])
            (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
     (nil))
(insn 572 16 20 2 (set (mem/c:QI (plus:HI (reg/f:HI 28 r28)
                (const_int 1 [0x1])) [4 %sfp+1 S1 A8])
        (reg:QI 24 r24 [orig:185 _1+4 ] [185])) "bf.c":11:8 86 {movqi_insn_split}
     (nil))
*************************************************************
Insn 13 and insn 572 use sfp+1 as a spill slot, but in IRA pass it was a two
different pseudos r184 and r185.
Insns 13 use sfp+1 as a spill slot for r184
Insns 572 use the same slot for r185. It's wrong.

Here we have a rematerialization.

Fragment from bf.c.317r.reload:
**************************************************************************************
******** Rematerialization #1: ********

df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 (    1)
df_worklist_dataflow_doublequeue: n_basic_blocks 14 n_edges 18 count 14 (    1)

Cands:
0 (nop=0, remat_regno=185, reload_regno=359):
(insn 16 15 572 2 (set (reg:QI 359 [orig:185 _1+4 ] [185])
                    (zero_extract:QI (reg:QI 64 [ a+4 ])
                        (const_int 1 [0x1])
                        (const_int 0 [0]))) "bf.c":11:8 985 {*extzvqi_split}
                 (nil))

**************************************************************************************
[...]
**************************************************************************************
Ranges after the compression:
 r185: [0..1]
	   Frame pointer can not be eliminated anymore
	   Spilling non-eliminable hard regs: 28 29
	 Spilling r113(28)
	 Spilling r184(29)
	 Spilling r208(29)
	 Spilling r209(28)
  Slot 0 regnos (width = 0):	 185	 209	 208	 184	 113
**************************************************************************************

The bug is here: `r185: [0..1]' wrong live range after compression.
r185 and r184 can't have the same spill slot !

Rematerialization in bf.c.317r.reload looks like:
*************************************************************
   24: r14:QI=r185:QI
    Inserting rematerialization insn before:
  581: r14:QI=zero_extract(r64:QI,0x1,0)

deleting insn with uid = 24.
         Considering alt=0 of insn 16:   (0) =r  (1) rYil  (2) n
          overall=0,losers=0,rld_nregs=0
   32: r22:QI=r185:QI
    Inserting rematerialization insn before:
  582: r22:QI=zero_extract(r64:QI,0x1,0)

deleting insn with uid = 32.
*************************************************************

It's happened because:

Fragment from lra.c (lra):
*************************************************************************
      if (! live_p)
	{
	  /* We need full live info for spilling pseudos into
	     registers instead of memory.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
	  live_p = true;
	}
      /* We should check necessity for spilling here as the above live
	 range pass can remove spilled pseudos.  */
      if (! lra_need_for_spills_p ())
	break;
      /* Now we know what pseudos should be spilled.  Try to
	 rematerialize them first.  */
      if (lra_remat ())
	{
	  /* We need full live info -- see the comment above.  */
	  lra_create_live_ranges (lra_reg_spill_p, true);
----------------------------------^^^^^^^^^^^^^^^
	  live_p = true;
*************************************************************************

The bug is here.
Rematerialization sometimes can be like spilling pseudos into registers.
  582: r22:QI=zero_extract(r64:QI,0x1,0)

So, here we need a live ranges for all pseudos.

PS: the patch will not affect any target with usable definition of
    TARGET_SPILL_CLASS hook.

	PR target/116778
gcc/
	* lra-lives.cc (complete_info_p): Clarification of the comment.
	* lra.cc (lra): Create a full live info after rematerialization.
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
This PR reports a missed optimization.  When we have:

  Str str{"Test"};
  callback(str);

as in the test, we're able to evaluate the Str::Str() call at compile
time.  But when we have:

  callback(Str{"Test"});

we are not.  With this patch (in fact, it's Patrick's patch with a little
tweak), we turn

  callback (TARGET_EXPR <D.2890, <<< Unknown tree: aggr_init_expr
    5
    __ct_comp
    D.2890
    (struct Str *) <<< Unknown tree: void_cst >>>
    (const char *) "Test" >>>>)

into

  callback (TARGET_EXPR <D.2890, {.str=(const char *) "Test", .length=4}>)

I explored the idea of calling maybe_constant_value for the whole
TARGET_EXPR in cp_fold.  That has three problems:
- we can't always elide a TARGET_EXPR, so we'd have to make sure the
  result is also a TARGET_EXPR;
- the resulting TARGET_EXPR must have the same flags, otherwise Bad
  Things happen;
- getting a new slot is also problematic.  I've seen a test where we
  had "TARGET_EXPR<D.2680, ...>, D.2680", and folding the whole TARGET_EXPR
  would get us "TARGET_EXPR<D.2681, ...>", but since we don't see the outer
  D.2680, we can't replace it with D.2681, and things break.

With this patch, two tree-ssa tests regressed: pr78687.C and pr90883.C.

FAIL: g++.dg/tree-ssa/pr90883.C   scan-tree-dump dse1 "Deleted redundant store: .*.a = {}"
is easy.  Previously, we would call C::C, so .gimple has:

  D.2590 = {};
  C::C (&D.2590);
  D.2597 = D.2590;
  return D.2597;

Then .einline inlines the C::C call:

  D.2590 = {};
  D.2590.a = {}; // #1
  D.2590.b = 0;  // #2
  D.2597 = D.2590;
  D.2590 ={v} {CLOBBER(eos)};
  return D.2597;

then #2 is removed in .fre1, and #1 is removed in .dse1.  So the test
passes.  But with the patch, .gimple won't have that C::C call, so the
IL is of course going to look different.  The .optimized dump looks the
same though so there's no problem.

pr78687.C is XFAILed because the test passes with r15-5746 but not with
r15-5747 as well.  I opened <https://gcc.gnu.org/PR117971>.

	PR c++/116416

gcc/cp/ChangeLog:

	* cp-gimplify.cc (cp_fold_r) <case TARGET_EXPR>: Try to fold
	TARGET_EXPR_INITIAL and replace it with the folded result if
	it's TREE_CONSTANT.

gcc/testsuite/ChangeLog:

	* g++.dg/analyzer/pr97116.C: Adjust dg-message.
	* g++.dg/tree-ssa/pr78687.C: Add XFAIL.
	* g++.dg/tree-ssa/pr90883.C: Adjust dg-final.
	* g++.dg/cpp0x/constexpr-prvalue1.C: New test.
	* g++.dg/cpp1y/constexpr-prvalue1.C: New test.

Co-authored-by: Patrick Palka <ppalka@redhat.com>
Reviewed-by: Jason Merrill <jason@redhat.com>
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
With the changes in r15-1579-g792f97b44ff, the code used as "padding" in
the test case is optimized way. Prevent this optimization by forcing a
read of the volatile memory.
Also, validate that there is a far jump in the generated assembler.

Without this patch, the generated assembler is reduced to:
f3:
        cmp     r0, #0
        beq     .L1
        ldr     r4, .L6
.L1:
        bx      lr
.L7:
        .align  2
.L6:
        .word   g_0_1

With the patch, the generated assembler is:
f3:
        movs    r2, #1
        ldr     r3, .L6
        push    {lr}
        str     r2, [r3]
        cmp     r0, #0
        bne     .LCB10
        bl      .L1     @far jump
.LCB10:
        b       .L7
.L8:
        .align  2
.L6:
        .word   .LANCHOR0
.L7:
        str     r2, [r3]
        ...
        str     r2, [r3]
.L1:
        pop     {pc}

gcc/testsuite/ChangeLog:

	* gcc.target/arm/thumb1-far-jump-2.c: Write to volatile memmory
	in macro to avoid optimization.

Signed-off-by: Torbjörn SVENSSON <torbjorn.svensson@foss.st.com>
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
On Cortex-M4, the code generated is:
     cmp     r0, r1
     itte    ne
     lslne   r0, r0, r1
     asrne   r0, r0, #1
     moveq   r0, r1
     add     r0, r0, r1
     bx      lr

On Cortex-M7, the code generated is:
     cmp     r0, r1
     beq     .L3
     lsls    r0, r0, r1
     asrs    r0, r0, #1
     add     r0, r0, r1
     bx      lr
.L3:
     mov     r0, r1
     add     r0, r0, r1
     bx      lr

As Cortex-M7 only allow maximum one conditional instruction, force
Cortex-M4 to have a stable test case.

gcc/testsuite/ChangeLog:

	* gcc.target/arm/thumb-ifcvt.c: Use -mtune=cortex-m4.

Signed-off-by: Torbjörn SVENSSON <torbjorn.svensson@foss.st.com>
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
This crash started with my r12-7803 but I believe the problem lies
elsewhere.

build_vec_init has cleanup_flags whose purpose is -- if I grok this
correctly -- to avoid destructing an object multiple times.  Let's
say we are initializing an array of A.  Then we might end up in
a scenario similar to initlist-eh1.C:

  try
    {
      call A::A in a loop
      // #0
      try
        {
	  call a fn using the array
	}
      finally
	{
	  // #1
	  call A::~A in a loop
	}
    }
  catch
    {
      // #2
      call A::~A in a loop
    }

cleanup_flags makes us emit a statement like

  D.3048 = 2;

at #0 to disable performing the cleanup at #2, since #1 will take
care of the destruction of the array.

But if we are not emitting the loop because we can use a constant
initializer (and use a single { a, b, ...}), we shouldn't generate
the statement resetting the iterator to its initial value.  Otherwise
we crash in gimplify_var_or_parm_decl because it gets the stray decl
D.3048.

	PR c++/117985

gcc/cp/ChangeLog:

	* init.cc (build_vec_init): Pop CLEANUP_FLAGS if we're not
	generating the loop.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/initlist-array23.C: New test.
	* g++.dg/cpp0x/initlist-array24.C: New test.
NinaRanns pushed a commit that referenced this pull request Jan 20, 2025
This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
default. To that end, the function aarch64_use_new_vector_costs_p and its uses
were removed. To prevent costing vec_to_scalar operations with 0, as
described in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
we adjusted vectorizable_store such that the variable n_adjacent_stores
also covers vec_to_scalar operations. This way vec_to_scalar operations
are not costed individually, but as a group.
As suggested by Richard Sandiford, the "known_ne" in the multilane-check
was replaced by "maybe_ne" in order to treat nunits==1+1X as a vector
rather than a scalar.

Two tests were adjusted due to changes in codegen. In both cases, the
old code performed loop unrolling once, but the new code does not:
Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L94:
-       ld1d    z27.d, p7/z, [x0, #1, mul vl]
-       ld1d    z28.d, p6/z, [x0]
-       movprfx z29, z31
-       mul     z29.s, p5/m, z29.s, z30.s
-       incw    x4
-       uunpklo z0.d, z29.s
-       uunpkhi z29.d, z29.s
-       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
-       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
-       add     z25.d, z28.d, z25.d
+       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
+       movprfx z28, z31
+       mul     z28.s, p6/m, z28.s, z30.s
+       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
        add     z26.d, z27.d, z26.d
-       st1d    z26.d, p7, [x0, #1, mul vl]
-       whilelo p7.d, x4, x2
-       st1d    z25.d, p6, [x0]
-       incw    z30.s
-       incb    x0, all, mul #2
-       whilelo p6.d, x4, x3
+       st1d    z26.d, p7, [x0, x4, lsl 3]
+       add     z30.s, z30.s, z29.s
+       incd    x4
+       whilelo p7.d, x4, x3
        b.any   .L94
 .L92:
        ret

Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
        cbz     w3, .L84
-       addvl   x5, x1, #1
        mov     x4, 0
        uxtw    x3, w3
-       mov     z31.s, w2
+       cntd    x5
        whilelo p7.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p6.d, xzr, x2
+       mov     z29.s, w5
+       mov     z31.s, w2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L86:
-       ld1d    z28.d, p7/z, [x1, x4, lsl 3]
-       ld1d    z27.d, p6/z, [x5, x4, lsl 3]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       add     z28.d, z28.d, #1
-       uunpklo z26.d, z29.s
-       st1d    z28.d, p7, [x0, z26.d, lsl 3]
-       incw    x4
-       uunpkhi z29.d, z29.s
+       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
        add     z27.d, z27.d, #1
-       whilelo p6.d, x4, x2
-       st1d    z27.d, p7, [x0, z29.d, lsl 3]
-       incw    z30.s
+       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
+       incd    x4
+       add     z30.s, z30.s, z29.s
        whilelo p7.d, x4, x3
        b.any   .L86
 .L84:
	ret

The patch was bootstrapped and tested on aarch64-linux-gnu, no
regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>

gcc/
	* tree-vect-stmts.cc (vectorizable_store): Extend the use of
	n_adjacent_stores to also cover vec_to_scalar operations.
	* config/aarch64/aarch64-tuning-flags.def: Remove
	use_new_vector_costs as tuning option.
	* config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
	Remove.
	(aarch64_vector_costs::add_stmt_cost): Remove use of
	aarch64_use_new_vector_costs_p.
	(aarch64_vector_costs::finish_cost): Remove use of
	aarch64_use_new_vector_costs_p.
	* config/aarch64/tuning_models/cortexx925.h: Remove
	AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
	* config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
	* config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
	* config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
	* config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
	* config/aarch64/tuning_models/neoversen2.h: Likewise.
	* config/aarch64/tuning_models/neoversen3.h: Likewise.
	* config/aarch64/tuning_models/neoversev1.h: Likewise.
	* config/aarch64/tuning_models/neoversev2.h: Likewise.
	* config/aarch64/tuning_models/neoversev3.h: Likewise.
	* config/aarch64/tuning_models/neoversev3ae.h: Likewise.

gcc/testsuite/
	* gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
	* gcc.target/aarch64/sve/strided_store_2.c: Likewise.
iains pushed a commit that referenced this pull request Feb 21, 2025
In a member-specification of a class, a noexcept-specifier is
a complete-class context.  Thus we delay parsing until the end of
the class via our DEFERRED_PARSE mechanism; see cp_parser_save_noexcept
and cp_parser_late_noexcept_specifier.

We also attempt to defer instantiation of noexcept-specifiers in order
to reduce the number of instantiations; this is done via DEFERRED_NOEXCEPT.

We can even have both, as in noexcept65.C: a DEFERRED_PARSE wrapped in
DEFERRED_NOEXCEPT, which uses the DEFPARSE_INSTANTIATIONS mechanism.
noexcept65.C works, because when we really need the noexcept, which is
when parsing the body of S::A::A(), the noexcept will have been parsed
already; noexcepts are parsed before bodies of member function.

But in this test we have:

  struct A {
      int x;
      template<class>
      void foo() noexcept(noexcept(x)) {}
      auto bar() -> decltype(foo<int>()) {} // #1
  };

and I think the decltype in #1 needs the unparsed noexcept before it
could have been parsed.  clang++ rejects the test and I suppose we
should reject it as well, rather than crashing on a DEFERRED_PARSE
in tsubst_expr.

	PR c++/117106
	PR c++/118190

gcc/cp/ChangeLog:

	* pt.cc (maybe_instantiate_noexcept): Give an error if the noexcept
	hasn't been parsed yet.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/noexcept89.C: New test.
	* g++.dg/cpp0x/noexcept90.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
NinaRanns pushed a commit that referenced this pull request Jun 26, 2025
This patch adds a new param vect-scalar-cost-multiplier to scale the scalar
costing during vectorization.  If the cost is set high enough and when using
the dynamic cost model it has the effect of effectively disabling the
costing vs scalar and assumes all vectorization to be profitable.

This is similar to using the unlimited cost model, but unlike unlimited it
does not fully disable the vector cost model.  That means that we still
perform comparisons between vector modes.  And it means it also still does
costing for alias analysis.

As an example, the following:

void
foo (char *restrict a, int *restrict b, int *restrict c,
     int *restrict d, int stride)
{
    if (stride <= 1)
        return;

    for (int i = 0; i < 3; i++)
        {
            int res = c[i];
            int t = b[i * stride];
            if (a[i] != 0)
                res = t * d[i];
            c[i] = res;
        }
}

compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
vectorize as it assumes scalar would be faster, and with
-fvect-cost-model=unlimited it picks a vector type that's so big that the large
sequence generated is working on mostly inactive lanes:

        ...
        and     p3.b, p3/z, p4.b, p4.b
        whilelo p0.s, wzr, w7
        ld1w    z23.s, p3/z, [x3, #3, mul vl]
        ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
        add     x0, x5, x0
        punpklo p6.h, p6.b
        ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
        and     p6.b, p6/z, p0.b, p0.b
        punpklo p4.h, p7.b
        ld1w    z24.s, p6/z, [x3, #2, mul vl]
        and     p4.b, p4/z, p2.b, p2.b
        uqdecw  w6
        ld1w    z26.s, p4/z, [x3]
        whilelo p1.s, wzr, w6
        mul     z27.s, p5/m, z27.s, z23.s
        ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
        punpkhi p7.h, p7.b
        mul     z24.s, p5/m, z24.s, z28.s
        and     p7.b, p7/z, p1.b, p1.b
        mul     z26.s, p5/m, z26.s, z30.s
        ld1w    z25.s, p7/z, [x3, #1, mul vl]
        st1w    z27.s, p3, [x2, #3, mul vl]
        mul     z25.s, p5/m, z25.s, z29.s
        st1w    z24.s, p6, [x2, #2, mul vl]
        st1w    z25.s, p7, [x2, #1, mul vl]
        st1w    z26.s, p4, [x2]
        ...

With -fvect-cost-model=dynamic --param vect-scalar-cost-multiplier=200
you get more reasonable code:

foo:
        cmp     w4, 1
        ble     .L1
        ptrue   p7.s, vl3
        index   z0.s, #0, w4
        ld1b    z29.s, p7/z, [x0]
        ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
	ptrue   p6.b, all
        cmpne   p7.b, p7/z, z29.b, #0
        ld1w    z31.s, p7/z, [x3]
	mul     z31.s, p6/m, z31.s, z30.s
        st1w    z31.s, p7, [x2]
.L1:
        ret

This model has been useful internally for performance exploration and cost-model
validation.  It allows us to force realistic vectorization overriding the cost
model to be able to tell whether it's correct wrt to profitability.

gcc/ChangeLog:

	* params.opt (vect-scalar-cost-multiplier): New.
	* tree-vect-loop.cc (vect_estimate_min_profitable_iters): Use it.
	* doc/invoke.texi (vect-scalar-cost-multiplier): Document it.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/cost_model_16.c: New test.
NinaRanns pushed a commit that referenced this pull request Jul 14, 2025
When using SVE INDEX to load an Advanced SIMD vector, we need to
take account of the different element ordering for big-endian
targets.  For example, when big-endian targets store the V4SI
constant { 0, 1, 2, 3 } in registers, 0 becomes the most
significant element, whereas INDEX always operates from the
least significant element.  A big-endian target would therefore
load V4SI { 0, 1, 2, 3 } using:

    INDEX Z0.S, #3, #-1

rather than little-endian's:

    INDEX Z0.S, #0, #1

While there, I noticed that we would only check the first vector
in a multi-vector SVE constant, which would trigger an ICE if the
other vectors turned out to be invalid.  This is pretty difficult to
trigger at the moment, since we only allow single-register modes to be
used as frontend & middle-end vector modes, but it can be seen using
the RTL frontend.

gcc/
	* config/aarch64/aarch64.cc (aarch64_sve_index_series_p): New
	function, split out from...
	(aarch64_simd_valid_imm): ...here.  Account for the different
	SVE and Advanced SIMD element orders on big-endian targets.
	Check each vector in a structure mode.

gcc/testsuite/
	* gcc.dg/rtl/aarch64/vec-series-1.c: New test.
	* gcc.dg/rtl/aarch64/vec-series-2.c: Likewise.
	* gcc.target/aarch64/sve/acle/general/dupq_2.c: Fix expected
	output for this big-endian test.
	* gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise.
	* gcc.target/aarch64/sve/vec_init_3.c: Restrict to little-endian
	targets and add more tests.
	* gcc.target/aarch64/sve/vec_init_4.c: New big-endian version
	of vec_init_3.c.
NinaRanns pushed a commit that referenced this pull request Jul 22, 2025
This was failing for two reasons:

1) We were wrongly treating the basic_string constructor as
zero-initializing the object, which it doesn't.
2) Given that, when we went to look for a value for the anonymous union,
we concluded that it was value-initialized, and trying to evaluate that
broke because we weren't setting ctx->ctor for it.

This patch fixes both issues, #1 by setting CONSTRUCTOR_NO_CLEARING and #2
by inserting a new CONSTRUCTOR for the member rather than evaluate it out of
context, which is consistent with cxx_eval_store_expression.

	PR c++/120577

gcc/cp/ChangeLog:

	* constexpr.cc (cxx_eval_call_expression): Set
	CONSTRUCTOR_NO_CLEARING on initial value for ctor.
	(cxx_eval_component_reference): Make value-initialization
	of an aggregate member explicit.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp2a/constexpr-union9.C: New test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant