Skip to content

Commit 5e17b8e

Browse files
Nick Childgregkh
authored andcommitted
ibmvnic: Ensure login failure recovery is safe from other resets
commit 6db541a upstream. If a login request fails, the recovery process should be protected against parallel resets. It is a known issue that freeing and registering CRQ's in quick succession can result in a failover CRQ from the VIOS. Processing a failover during login recovery is dangerous for two reasons: 1. This will result in two parallel initialization processes, this can cause serious issues during login. 2. It is possible that the failover CRQ is received but never executed. We get notified of a pending failover through a transport event CRQ. The reset is not performed until a INIT CRQ request is received. Previously, if CRQ init fails during login recovery, then the ibmvnic irq is freed and the login process returned error. If failover_pending is true (a transport event was received), then the ibmvnic device would never be able to process the reset since it cannot receive the CRQ_INIT request due to the irq being freed. This leaved the device in a inoperable state. Therefore, the login failure recovery process must be hardened against these possible issues. Possible failovers (due to quick CRQ free and init) must be avoided and any issues during re-initialization should be dealt with instead of being propagated up the stack. This logic is similar to that of ibmvnic_probe(). Fixes: dff515a ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-5-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 206ccf4 commit 5e17b8e

File tree

1 file changed

+47
-21
lines changed

1 file changed

+47
-21
lines changed

drivers/net/ethernet/ibm/ibmvnic.c

Lines changed: 47 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ static void ibmvnic_tx_scrq_clean_buffer(struct ibmvnic_adapter *adapter,
115115
static void free_long_term_buff(struct ibmvnic_adapter *adapter,
116116
struct ibmvnic_long_term_buff *ltb);
117117
static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter);
118+
static void flush_reset_queue(struct ibmvnic_adapter *adapter);
118119

119120
struct ibmvnic_stat {
120121
char name[ETH_GSTRING_LEN];
@@ -1316,8 +1317,8 @@ static const char *adapter_state_to_string(enum vnic_state state)
13161317

13171318
static int ibmvnic_login(struct net_device *netdev)
13181319
{
1320+
unsigned long flags, timeout = msecs_to_jiffies(20000);
13191321
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
1320-
unsigned long timeout = msecs_to_jiffies(20000);
13211322
int retry_count = 0;
13221323
int retries = 10;
13231324
bool retry;
@@ -1382,6 +1383,7 @@ static int ibmvnic_login(struct net_device *netdev)
13821383
"SCRQ irq initialization failed\n");
13831384
return rc;
13841385
}
1386+
/* Default/timeout error handling, reset and start fresh */
13851387
} else if (adapter->init_done_rc) {
13861388
netdev_warn(netdev, "Adapter login failed, init_done_rc = %d\n",
13871389
adapter->init_done_rc);
@@ -1397,29 +1399,53 @@ static int ibmvnic_login(struct net_device *netdev)
13971399
"Freeing and re-registering CRQs before attempting to login again\n");
13981400
retry = true;
13991401
adapter->init_done_rc = 0;
1400-
retry_count++;
14011402
release_sub_crqs(adapter, true);
1402-
reinit_init_done(adapter);
1403-
release_crq_queue(adapter);
1404-
/* If we don't sleep here then we risk an unnecessary
1405-
* failover event from the VIOS. This is a known VIOS
1406-
* issue caused by a vnic device freeing and registering
1407-
* a CRQ too quickly.
1403+
/* Much of this is similar logic as ibmvnic_probe(),
1404+
* we are essentially re-initializing communication
1405+
* with the server. We really should not run any
1406+
* resets/failovers here because this is already a form
1407+
* of reset and we do not want parallel resets occurring
14081408
*/
1409-
msleep(1500);
1410-
rc = init_crq_queue(adapter);
1411-
if (rc) {
1412-
netdev_err(netdev, "login recovery: init CRQ failed %d\n",
1413-
rc);
1414-
return -EIO;
1415-
}
1409+
do {
1410+
reinit_init_done(adapter);
1411+
/* Clear any failovers we got in the previous
1412+
* pass since we are re-initializing the CRQ
1413+
*/
1414+
adapter->failover_pending = false;
1415+
release_crq_queue(adapter);
1416+
/* If we don't sleep here then we risk an
1417+
* unnecessary failover event from the VIOS.
1418+
* This is a known VIOS issue caused by a vnic
1419+
* device freeing and registering a CRQ too
1420+
* quickly.
1421+
*/
1422+
msleep(1500);
1423+
/* Avoid any resets, since we are currently
1424+
* resetting.
1425+
*/
1426+
spin_lock_irqsave(&adapter->rwi_lock, flags);
1427+
flush_reset_queue(adapter);
1428+
spin_unlock_irqrestore(&adapter->rwi_lock,
1429+
flags);
1430+
1431+
rc = init_crq_queue(adapter);
1432+
if (rc) {
1433+
netdev_err(netdev, "login recovery: init CRQ failed %d\n",
1434+
rc);
1435+
return -EIO;
1436+
}
14161437

1417-
rc = ibmvnic_reset_init(adapter, false);
1418-
if (rc) {
1419-
netdev_err(netdev, "login recovery: Reset init failed %d\n",
1420-
rc);
1421-
return -EIO;
1422-
}
1438+
rc = ibmvnic_reset_init(adapter, false);
1439+
if (rc)
1440+
netdev_err(netdev, "login recovery: Reset init failed %d\n",
1441+
rc);
1442+
/* IBMVNIC_CRQ_INIT will return EAGAIN if it
1443+
* fails, since ibmvnic_reset_init will free
1444+
* irq's in failure, we won't be able to receive
1445+
* new CRQs so we need to keep trying. probe()
1446+
* handles this similarly.
1447+
*/
1448+
} while (rc == -EAGAIN && retry_count++ < retries);
14231449
}
14241450
} while (retry);
14251451

0 commit comments

Comments
 (0)