Is ConnectorX better than pandas in MSSQL？ #200

cufewxy · 2021-12-18T16:58:29Z

cufewxy
Dec 18, 2021

Hey, I have been trying to find a better module than pandas.read_sql and feel that it is almost there.
When I tried in mysql, connectorx performs better than pandas. But when I tried mssql, connectorx costs about 2x time than pandas(still connectorx's memory consuming is better).

The query is very simple: filter a table that one column is greater than a value. In pandas, I use pd.read_sql(xxx, engine) and engine is defined as sqlalchemy.create_engine('mssql+pymssql:xxx'). I also tried pyodbc and connectorx is again slower than pandas.

cufewxy · 2021-12-18T17:22:26Z

cufewxy
Dec 18, 2021
Author

When I open debug mode by setting os.environ, I find that row counting is consuming(34s). After row counting, writing stage costs 37s. Total cost time is 74s. pandas costs 36s.

When I set return_type=arrow, no need to row counting but writing costs 60s.

0 replies

cufewxy · 2021-12-19T04:19:24Z

cufewxy
Dec 19, 2021
Author

https://github.com/sfu-db/connector-x/discussions/156 Oh I see... In MSSQL it will execute twice. Will you optimize getting metadata in MSSQL?

0 replies

wangxiaoying · 2021-12-20T07:38:50Z

wangxiaoying
Dec 20, 2021
Maintainer

Hi @cufewxy , thanks for the example and the logs! In our benchmark, connectorx could achieve 5x less time than pandas. We directly select the TPC-H lineitem table, which has 60M rows and 16 columns using 4 partitions. We also tested connectorx using 1 partition and it outperforms pandas by >3x.

connectorx is targeting on the large query result fetching scenario. It speeds up the process by optimizing the client-side execution as well as saturating both network and machine resource through parallelism. However, when query execution on the database server is the bottleneck (e.g. query execution time is long but the result is relatively small), the improvement ConnectorX could get will become minor and sometimes it can be even slower than Pandas due to the overhead in fetching metadata.

It is interesting to see that in your case although the query is not very complex, the performance is worse than pandas. We can see if we could try to reproduce your case in our environment and see whether we could further improve it. It would be great if you could share more detailed information of the query with us:

schema of the table
rows of the entire table
how does the predicate look like
does the filtering column have index or not?

Will you optimize getting metadata in MSSQL?

We are currently trying to find a more efficient way to get schema in order to speed up the process for mssql. But I think in your case it is not the main issue (in figure 2 the schema fetching procedure only uses 4 seconds).

4 replies

cufewxy Dec 20, 2021
Author

Thansk for replying. I tried again and connectorX performs better this time, all cost time is much shortter than yesterday(for connectorX, 8s VS 74s)...
It seems that database was not running well yesterday...Thansk soooo much!

I mainly do finance analysis and this table records stock price timeseries. But neither code column nor date column is Integer so there's no appropriate column for partitioning. Will you support varchar column to be partitioned in the future?

wangxiaoying Dec 20, 2021
Maintainer

Oh, that's great!

Will you support varchar column to be partitioned in the future?

We don't have plan for this for now. We are still collecting the needs for this as well as more automatic ways to do query partitioning. But for non-numerical columns, a workaround is to manually partition the query. For example, if the columns is date:

q1 = "select * from test where date < '2021-01-01'"
q2 = "select * from test where date >= '2021-01-01'"
df = cx.read_sql(url, [q1, q2])

connectorx will return a single dataframe combine the result of the two queries

cufewxy Dec 21, 2021
Author

Thanks, it really works! When I split sql into 4 parts, it speeds up 4x.

I plan to wrap read_sql func in which varchar column can be sent. First use select distinct(partition_on_column) from (raw_sql) as a to get sql list, then sent to cx.read_sql. I find that select distinct(partition_on_column) from (raw_sql) as a nearly costs no time, I guess database optimize this sql in MSSQL. Maybe that's why select count(*) from (raw_sql) in cx cost no time..

wangxiaoying Dec 21, 2021
Maintainer

Awesome!

find that select distinct(partition_on_column) from (raw_sql) as a nearly costs no time, I guess database optimize this sql in MSSQL. Maybe that's why select count(*) from (raw_sql) in cx cost no time..

Yeah, for some simple queries, maybe mssql can leverage the index or collected statistics to directly answer the distinct query. (Not so sure about more complex ones, but I guess the overhead should be similar to the count query when the number of distinct value is not so large.) Looking forward to your wrapped function!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is ConnectorX better than pandas in MSSQL？ #200

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is ConnectorX better than pandas in MSSQL？ #200

Uh oh!

cufewxy Dec 18, 2021

Replies: 3 comments · 4 replies

Uh oh!

cufewxy Dec 18, 2021 Author

Uh oh!

cufewxy Dec 19, 2021 Author

Uh oh!

wangxiaoying Dec 20, 2021 Maintainer

Uh oh!

cufewxy Dec 20, 2021 Author

Uh oh!

wangxiaoying Dec 20, 2021 Maintainer

Uh oh!

cufewxy Dec 21, 2021 Author

Uh oh!

wangxiaoying Dec 21, 2021 Maintainer

cufewxy
Dec 18, 2021

Replies: 3 comments 4 replies

cufewxy
Dec 18, 2021
Author

cufewxy
Dec 19, 2021
Author

wangxiaoying
Dec 20, 2021
Maintainer

cufewxy Dec 20, 2021
Author

wangxiaoying Dec 20, 2021
Maintainer

cufewxy Dec 21, 2021
Author

wangxiaoying Dec 21, 2021
Maintainer