Skip to content

Commit 1d3b019

Browse files
authored
Merge pull request #223 from pynbody/mysql-optimize
MySQL optimization
2 parents e56c28a + 110ed59 commit 1d3b019

21 files changed

+764
-235
lines changed

docs/advanced.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Advanced topics
44
_Tangos_ is a highly flexible, customisable system. Tutorials are available covering the following
55
topics:
66

7+
- Working with [different database systems](dbms.md) (e.g. MySQL and PostgreSQL)
78
- Writing code to [calculate your own properties](custom_properties.md)
89
- [Tracking](tracking.md) groups of particles across timesteps
910
- [Parallelisation strategies](mpi.md)

docs/index.md

Lines changed: 0 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -77,45 +77,6 @@ MySQL / MariaDB below.
7777
Remember, you will need to set these environment variables *every* time you start a new session on your computer prior
7878
to booting up the database, either with the webserver or the python interface (see below).
7979

80-
Using PostgreSQL, MySQL or MariaDB
81-
----------------------------------
82-
83-
As stated above, tangos is agnostic to the underlying SQL flavour. It is easiest to get start with
84-
SQLite which doesn't need any special server. But version 1.5+ should also work well with [MySQL](https://www.mysql.com),
85-
[MariaDB](https://mariadb.org) and version 1.7+ also with [PostgreSQL](https://www.postgresql.org).
86-
87-
To try this out, if you have [docker](https://docker.com), you can run a test
88-
MySQL server very easily:
89-
90-
```bash
91-
docker pull mysql
92-
docker run -d --name=mysql-server -p3306:3306 -e MYSQL_ROOT_PASSWORD=my_secret_password mysql
93-
echo "create database database_name;" | docker exec -i mysql-server mysql -pmy_secret_password
94-
```
95-
96-
Or, just as easily, you can get going with PostgreSQL:
97-
```bash
98-
docker pull postgres
99-
docker run --name tangos-postgres -e POSTGRES_USER=tangos -e POSTGRES_PASSWORD=my_secret_password -e POSTGRES_DB=database_name -p 5432:5432 -d postgres
100-
```
101-
102-
To be sure that python can connect to MySQL or PostgreSQL, install the appropriate modules:
103-
```bash
104-
pip install PyMySQL # for MySQL
105-
pip install psycopg2-binary # for PostgreSQL
106-
```
107-
108-
Tangos can now connect to your test MySQL server using the connection:
109-
```bash
110-
export TANGOS_DB_CONNECTION=mysql+pymysql://root:my_secret_password@localhost:3306/database_name
111-
```
112-
or for PostgreSQL:
113-
```bash
114-
export TANGOS_DB_CONNECTION=postgresql+psycopg2://tangos:my_secret_password@localhost/database_name
115-
```
116-
117-
You can now use all the tangos tools as normal, and they will populate the MySQL/PostgreSQL database
118-
instead of a SQLite file.
11980

12081

12182
Where next?

docs/rdbms.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
Working with different database systems
2+
=======================================
3+
4+
Tangos is built on sqlalchemy, which means that it is in principle possible to use any database system supported by sqlalchemy. However, different database systems have different features and limitations of which it is worth being aware.
5+
6+
The tangos tests are run with SQLite, mySQL and postgresql. Other databases, while supported by sqlalchemy, have not been directly tested. The following contain some notes on using these different systems.
7+
8+
SQLite
9+
------
10+
11+
SQLite is the default database. It is simple in the sense that it keeps your entire database within a single file which can easily be transferred to different systems. Additionally, the SQLite driver is included with Python and so it's quick to get started.
12+
13+
There are two major, related drawbacks to SQLite. The first is that the
14+
15+
PostgreSQL and MySQL
16+
--------------------
17+
18+
PostgreSQL and MySQL are both server-based systems, and as such take a little more effort to set up and maintain. If one exposes PostgreSQL to the outside world, there are potential security implications. One can of course run it on a firewalled computer and manage access appropriately, but this takes some expertise of its own (that will not be covered here). The major advantage is that you can host your data in a single location and allow multiple users to connect.
19+
20+
21+
22+
MySQL
23+
-----
24+
25+
MySQL is a server-based system, and as such takes a little more effort to set up. The advantage is that you can host your data in a single location and allow multiple users to connect. Additionally, it is able to cope much better with complex parallel writes than SQLite.
26+
27+
For most users, MySQL and PostgreSQL are
28+
29+
To try this out, if you have [docker](https://docker.com), you can run a test
30+
MySQL server very easily:
31+
32+
```bash
33+
docker pull mysql
34+
docker run -d --name=mysql-server -p3306:3306 -e MYSQL_ROOT_PASSWORD=my_secret_password mysql
35+
echo "create database database_name;" | docker exec -i mysql-server mysql -pmy_secret_password
36+
```
37+
38+
Or, just as easily, you can get going with PostgreSQL:
39+
```bash
40+
docker pull postgres
41+
docker run --name tangos-postgres -e POSTGRES_USER=tangos -e POSTGRES_PASSWORD=my_secret_password -e POSTGRES_DB=database_name -p 5432:5432 -d postgres
42+
```
43+
44+
To be sure that python can connect to MySQL or PostgreSQL, install the appropriate modules:
45+
```bash
46+
pip install PyMySQL # for MySQL
47+
pip install psycopg2-binary # for PostgreSQL
48+
```
49+
50+
Tangos can now connect to your test MySQL server using the connection:
51+
```bash
52+
export TANGOS_DB_CONNECTION=mysql+pymysql://root:my_secret_password@localhost:3306/database_name
53+
```
54+
or for PostgreSQL:
55+
```bash
56+
export TANGOS_DB_CONNECTION=postgresql+psycopg2://tangos:my_secret_password@localhost/database_name
57+
```
58+
59+
You can now use all the tangos tools as normal, and they will populate the MySQL/PostgreSQL database
60+
instead of a SQLite file.

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@
2121
'hupper',
2222
'scipy >= 0.14.0',
2323
'more_itertools >= 8.0.0',
24-
'matplotlib >= 3.0.0' # for web interface
24+
'matplotlib >= 3.0.0', # for web interface
25+
'tqdm >= 4.59.0'
2526
]
2627

2728
tests_require = [

tangos/config.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,11 @@
8787
# Default rtol for assert_amost_equal when using the diff tool
8888
diff_default_rtol = 1e-3
8989

90+
91+
# Database import: how many rows to copy at a time, and when to issue a commit
92+
DB_IMPORT_CHUNK_SIZE = 10
93+
DB_IMPORT_COMMIT_AFTER_CHUNKS = 500
94+
9095
try:
9196
from .config_local import *
9297
except:

tangos/core/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ def get_default_engine() -> sqlalchemy.engine.Engine:
6868
Index("halo_finder_index", SimulationObjectBase.__table__.c.finder_id)
6969
Index("haloproperties_creator_index", HaloProperty.__table__.c.creator_id)
7070
Index("halolink_index", HaloLink.__table__.c.halo_from_id)
71+
Index("halolink_bidirectional_index", HaloLink.__table__.c.halo_to_id, HaloLink.__table__.c.halo_from_id)
7172
Index("named_halolink_index", HaloLink.__table__.c.relation_id, HaloLink.__table__.c.halo_from_id)
7273

7374

tangos/core/halo.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import numpy as np
2-
from sqlalchemy import Column, ForeignKey, Integer, orm, types
2+
from sqlalchemy import BigInteger, Column, ForeignKey, Integer, orm, types
33
from sqlalchemy.orm import Session, backref, relationship
44

55
from . import Base, creator, extraction_patterns
@@ -30,9 +30,9 @@ class SimulationObjectBase(Base):
3030
__tablename__= "halos"
3131

3232
id = Column(Integer, primary_key=True) #the unique ID value of the database object created for this halo
33-
halo_number = Column(Integer) #by default this will be the halo's rank in terms of particle count
34-
finder_id = Column(UnsignedInteger) #raw halo ID from the halo catalog
35-
finder_offset = Column(Integer) #index of halo within halo catalog, primary identifier used when reading catalog/simulation data
33+
halo_number = Column(BigInteger) #by default this will be the halo's rank in terms of particle count
34+
finder_id = Column(BigInteger) #raw halo ID from the halo catalog
35+
finder_offset = Column(BigInteger) #index of halo within halo catalog, primary identifier used when reading catalog/simulation data
3636
timestep_id = Column(Integer, ForeignKey('timesteps.id'))
3737
timestep = relationship(TimeStep, backref=backref(
3838
'objects', order_by=halo_number, cascade_backrefs=False, lazy='dynamic'), cascade='')

tangos/parallel_tasks/testing.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from .message import Message
2+
3+
FILENAME = "parallel_tasks_test_log.txt"
4+
5+
def initialise_log():
6+
with open(FILENAME, "w") as f:
7+
f.write("")
8+
9+
def get_log():
10+
with open(FILENAME) as f:
11+
return f.readlines()
12+
13+
def log(message):
14+
ServerLogMessage(message).send(0)
15+
16+
class ServerLogMessage(Message):
17+
def process(self):
18+
with open(FILENAME, "a") as f:
19+
f.write(f"[{self.source:d}] {self.contents:s}\r\n")

tangos/relation_finding/multi_hop.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313

1414
from .. import config, core, temporary_halolist
1515
from ..config import DOUBLE_PRECISION
16+
from ..log import logger
17+
from ..util.timing_monitor import TimingMonitor
1618
from .one_hop import HopStrategy
1719

1820

@@ -89,6 +91,7 @@ def __init__(self, halo_from, nhops_max=None, directed=None, target=None,
8991
self._combine_routes = combine_routes
9092
self._debug_output = False # set to True to see information about discovered links as hops progress
9193

94+
self.timing_monitor = TimingMonitor()
9295
def temp_table(self):
9396
"""Execute the strategy and return results as a temp_table (see temporary_halolist module)"""
9497
if self._all is None:
@@ -119,7 +122,10 @@ def _execute_query(self):
119122
with self._manage_temp_table():
120123
self._generate_multihop_results()
121124
try:
122-
results = self._order_query(self._generate_query(halo_ids_only=False)).all()
125+
q = self._order_query(self._generate_query(halo_ids_only=False))
126+
results = q.all()
127+
# NB the time here seems to be mainly making the ORM objects
128+
# rather than the query itself
123129
except sqlalchemy.exc.ResourceClosedError:
124130
results = []
125131

@@ -230,6 +236,7 @@ def _create_temp_table(self):
230236
)
231237

232238
self._table_index = Index('temp.source_id_index_' + rstr, multi_hop_link_table.c.source_id, multi_hop_link_table.c.nhops)
239+
self._table_nhop_index = Index('temp.nhop_index_' + rstr, multi_hop_link_table.c.nhops)
233240

234241
self._table = multi_hop_link_table
235242
self._prelim_table = multi_hop_link_prelim_table
@@ -252,12 +259,15 @@ def _seed_temp_table(self):
252259

253260
def _make_hops(self):
254261
for i in range(0, self.nhops_max):
255-
self._nhops_taken = i
256-
generated_count = self._generate_next_level_prelim_links(i)
257-
if generated_count != 0:
258-
filtered_count = self._filter_prelim_links_into_final()
259-
else:
260-
filtered_count = 0
262+
with self.timing_monitor(self):
263+
self._nhops_taken = i
264+
generated_count = self._generate_next_level_prelim_links(i)
265+
if generated_count != 0:
266+
filtered_count = self._filter_prelim_links_into_final()
267+
else:
268+
filtered_count = 0
269+
270+
# for performance info: self.timing_monitor.summarise_timing(logger)
261271

262272
if self._hopping_finished(filtered_count):
263273
break
@@ -267,7 +277,7 @@ def _hopping_finished(self, filtered_count):
267277
return filtered_count==0
268278

269279
def _generate_next_level_prelim_links(self, from_nhops=0):
270-
280+
self.timing_monitor.mark('prelim-insert')
271281
new_weight = self._table.c.weight * core.halo_data.HaloLink.weight
272282

273283
recursion_query = \
@@ -289,6 +299,7 @@ def _generate_next_level_prelim_links(self, from_nhops=0):
289299

290300
num_inserted = self._connection.execute(insert).rowcount
291301

302+
self.timing_monitor.mark('prelim-thin')
292303
if self._combine_routes:
293304
# Ideally, before self._prelim_table.insert(), one would adapt recursion_query to return the argmax of
294305
# new_weight, grouped by halo_to_id and source_id. That could have been achieved using:
@@ -327,6 +338,7 @@ def _debug_print_links(self, tab):
327338
print(f"[s{row[2]} i{row[4]}] {row[0].path} -> {row[1].path} w={row[3]:.2f}")
328339

329340
def _filter_prelim_links_into_final(self):
341+
self.timing_monitor.mark('final-insert')
330342
if self._debug_output:
331343
print()
332344
print(f"[{self._nhops_taken}] Preliminary links:")
@@ -346,9 +358,10 @@ def _filter_prelim_links_into_final(self):
346358
q)).rowcount
347359

348360
if self._debug_output:
349-
print(f"[{self._nhops_taken}] Accepted links:")
361+
logger.info(f"[{self._nhops_taken}] Accepted links:")
350362
self._debug_print_links(self._table)
351363

364+
self.timing_monitor.mark('final-thin')
352365
self._connection.execute(self._prelim_table.delete())
353366
return added_rows
354367

0 commit comments

Comments
 (0)