Details
-
Type:
Story
-
Status: To Do
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Story Points:2.8
-
Epic Link:
-
Team:SQuaRE
Description
These errors are becoming more frequent now and require better monitoring:
https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/323/
https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/336/ https://ci.lsst.codes/blue/organizations/jenkins/scipipe%2Fap_verify/detail/ap_verify/337/
It is not always reproducible and probably triggered by a new condition. API logs are not useful to track this down:
[pid: 22|app: 0|req: 185377/314425] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:31:52 2019] POST /job => generated 151 bytes in 1861 msecs (HTTP/1.1 202) 2 headers in 78 bytes (1 switches on core 1)
|
[pid: 21|app: 0|req: 94227/314427] 10.20.0.1 () {42 vars in 552 bytes} [Thu Aug 29 12:31:57 2019] POST /auth => generated 188 bytes in 55 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 1)
|
[2019-08-29 12:31:59,132] ERROR in app: Exception on /job [POST]
|
[pid: 22|app: 0|req: 185378/314428] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:31:57 2019] POST /job => generated 37 bytes in 1344 msecs (HTTP/1.1 500) 2 headers in 90 bytes (1 switches on core 0)
|
[pid: 22|app: 0|req: 185380/314432] 10.20.0.1 () {42 vars in 552 bytes} [Thu Aug 29 12:36:31 2019] POST /auth => generated 188 bytes in 56 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 0)
|
[pid: 22|app: 0|req: 185381/314433] 10.20.0.1 () {44 vars in 750 bytes} [Thu Aug 29 12:36:31 2019] POST /job => generated 151 bytes in 1497 msecs (HTTP/1.1 202) 2 headers in 78 bytes (1 switches on core 1)
|
Attachments
Activity
Another situation where a 500 error occurs is during the pod initialization. Sometimes the cloud sql proxy container is not ready when the API container starts. In this case it just needs to retry.
$ kubectl logs squash-restful-api-7c65bcb577-dz2zb -c api -nsquash-prod | more
|
[uWSGI] getting INI configuration from uwsgi.ini
|
*** Starting uWSGI 2.0.18 (64bit) on [Sat Aug 1 00:59:52 2020] ***
|
compiled with version: 6.3.0 20170516 on 29 March 2019 15:15:08
|
os: Linux-4.19.112+ #1 SMP Tue Apr 28 06:48:29 PDT 2020
|
nodename: squash-restful-api-7c65bcb577-dz2zb
|
machine: x86_64
|
clock source: unix
|
detected number of CPU cores: 4
|
current working directory: /opt/squash
|
detected binary path: /usr/local/bin/uwsgi
|
!!! no internal routing support, rebuild with pcre support !!!
|
*** WARNING: you are running uWSGI without its master process manager ***
|
your memory page size is 4096 bytes
|
detected max file descriptor number: 1048576
|
lock engine: pthread robust mutexes
|
thunder lock: disabled (you can enable it with --thunder-lock)
|
uwsgi socket 0 bound to TCP address 127.0.0.1:5000 fd 3
|
Python version: 3.6.8 (default, Mar 27 2019, 08:53:45) [GCC 6.3.0 20170516]
|
Python main interpreter initialized at 0x55f8d3f9aa20
|
python threads support enabled
|
your server socket listen backlog is limited to 100 connections
|
your mercy for graceful operations on workers is 60 seconds
|
mapped 333376 bytes (325 KB) for 8 cores
|
*** Operational MODE: preforking+threaded ***
|
Traceback (most recent call last):
|
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 583, in connect
|
**kwargs)
|
File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection
|
raise err
|
File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection
|
sock.connect(sa)
|
ConnectionRefusedError: [Errno 111] Connection refused
|
|
During handling of the above exception, another exception occurred:
|
|
Traceback (most recent call last):
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
|
return fn()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
|
return _ConnectionFairy._checkout(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
|
fairy = _ConnectionRecord.checkout(pool)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
|
rec = pool._do_get()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
|
self._dec_overflow()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
|
compat.reraise(exc_type, exc_value, exc_tb)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
|
raise value
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
|
return self._create_connection()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
|
return _ConnectionRecord(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
|
self.__connect(first_connect_check=True)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
|
connection = pool._invoke_creator(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
|
return dialect.connect(*cargs, **cparams)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
|
return self.dbapi.connect(*cargs, **cparams)
|
File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
|
return Connection(*args, **kwargs)
|
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
|
self.connect()
|
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
|
raise exc
|
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
|
|
The above exception was the direct cause of the following exception:
|
|
Traceback (most recent call last):
|
File "run.py", line 48, in <module>
|
db.create_all()
|
File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 963, in create_all
|
self._execute_for_all_tables(app, bind, 'create_all')
|
File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 955, in _execute_for_all_tables
|
op(bind=self.get_engine(app, bind), **extra)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4287, in create_all
|
ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2032, in _run_visitor
|
with self._optional_conn_ctx_manager(connection) as conn:
|
File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
|
return next(self.gen)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2024, in _optional_conn_ctx_manager
|
with self._contextual_connect() as conn:
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2226, in _contextual_connect
|
self._wrap_pool_connect(self.pool.connect, None),
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2266, in _wrap_pool_connect
|
e, dialect, self
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1536, in _handle_dbapi_exception_noconnection
|
util.raise_from_cause(sqlalchemy_exception, exc_info)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
|
reraise(type(exception), exception, tb=exc_tb, cause=cause)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
|
raise value.with_traceback(tb)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
|
return fn()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
|
return _ConnectionFairy._checkout(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
|
fairy = _ConnectionRecord.checkout(pool)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
|
rec = pool._do_get()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
|
self._dec_overflow()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
|
compat.reraise(exc_type, exc_value, exc_tb)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
|
raise value
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
|
return self._create_connection()
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
|
return _ConnectionRecord(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
|
self.__connect(first_connect_check=True)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
|
connection = pool._invoke_creator(self)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
|
return dialect.connect(*cargs, **cparams)
|
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
|
return self.dbapi.connect(*cargs, **cparams)
|
File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
|
return Connection(*args, **kwargs)
|
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
|
self.connect()
|
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
|
raise exc
|
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
|
(Background on this error at: http://sqlalche.me/e/e3q8)
|
|
Show
Angelo Fausti
added a comment - Another situation where a 500 error occurs is during the pod initialization. Sometimes the cloud sql proxy container is not ready when the API container starts. In this case it just needs to retry.
$ kubectl logs squash-restful-api-7c65bcb577-dz2zb -c api -nsquash-prod | more
[uWSGI] getting INI configuration from uwsgi.ini
*** Starting uWSGI 2.0.18 (64bit) on [Sat Aug 1 00:59:52 2020] ***
compiled with version: 6.3.0 20170516 on 29 March 2019 15:15:08
os: Linux-4.19.112+ #1 SMP Tue Apr 28 06:48:29 PDT 2020
nodename: squash-restful-api-7c65bcb577-dz2zb
machine: x86_64
clock source: unix
detected number of CPU cores: 4
current working directory: /opt/squash
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
*** WARNING: you are running uWSGI without its master process manager ***
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address 127.0.0.1:5000 fd 3
Python version: 3.6.8 (default, Mar 27 2019, 08:53:45) [GCC 6.3.0 20170516]
Python main interpreter initialized at 0x55f8d3f9aa20
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 333376 bytes (325 KB) for 8 cores
*** Operational MODE: preforking+threaded ***
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 583, in connect
**kwargs)
File "/usr/local/lib/python3.6/socket.py", line 724, in create_connection
raise err
File "/usr/local/lib/python3.6/socket.py", line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
rec = pool._do_get()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
self._dec_overflow()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
return self._create_connection()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
return Connection(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
self.connect()
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
raise exc
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "run.py", line 48, in <module>
db.create_all()
File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 963, in create_all
self._execute_for_all_tables(app, bind, 'create_all')
File "/usr/local/lib/python3.6/site-packages/flask_sqlalchemy/__init__.py", line 955, in _execute_for_all_tables
op(bind=self.get_engine(app, bind), **extra)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/schema.py", line 4287, in create_all
ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2032, in _run_visitor
with self._optional_conn_ctx_manager(connection) as conn:
File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2024, in _optional_conn_ctx_manager
with self._contextual_connect() as conn:
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2226, in _contextual_connect
self._wrap_pool_connect(self.pool.connect, None),
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2266, in _wrap_pool_connect
e, dialect, self
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1536, in _handle_dbapi_exception_noconnection
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2262, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 354, in connect
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 751, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 483, in checkout
rec = pool._do_get()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 138, in _do_get
self._dec_overflow()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/impl.py", line 135, in _do_get
return self._create_connection()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 299, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 428, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool/base.py", line 630, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 114, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 453, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/pymysql/__init__.py", line 94, in Connect
return Connection(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 325, in __init__
self.connect()
File "/usr/local/lib/python3.6/site-packages/pymysql/connections.py", line 630, in connect
raise exc
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 111] Connection refused)")
(Background on this error at: http://sqlalche.me/e/e3q8)
On all three failed runs, Jenkins re-runs ap_verify and re-uploads the (new) results. Some runs have multiple copies of data in SQuaSH (most obvious with ap_association.totalUnassociatedDiaObjects, which should have exactly one value per run), while others do not.
All three runs upload ap_verify..verify.json (which contains totalUnassociatedDiaObjects) first, then try but fail to upload ap_verify.date2015-02-18_filterg_visit411371_ccdnum56_objectBlind15A_42.verify.json, so it's not clear why the duplication is inconsistent. Figuring that out might help track the exception down.