sql - Spark - searching spatial data - partition pruning -


i have large quantity of geotagged rows - hundreds of millions - need query using spark sql doing distance calculations points. sql works ok using basic triginometry , haversine distance function. result set returned latitude between +/- meters latitude point, , same longitude; ordered distance desc, , top-n find closest points. far good. data global, storing points in memory inefficient.

my questions:

  1. how benefit realize using partitioning pruning partitioning data latitude ranges, longitude subranges? reduce search area 1-3 latitude partitions, , < 10 longitude subpartitions. lot less data; dont know spark sql optimizer can prune partitions , subpartitions. unclear if partition pruning on cached rdd particularly beneficial. there no join involved.

  2. i partition using parquet files, , subsequently read in parquet partitions needed, instead of data. there other file format should use has partitioning capability?

you'll benefit partition pruning when initial read. spark sql's optimizer pruning if data in columnar data format (like parquet) , you're not reading columns, other sql db do. if filtering of data prior caching, you'll querying , persisting smaller subset of data anyways. optimizer take queries pass in , best read minimal amount of data disk.

orc files file format might want into. orc smaller in size when sitting on hdfs, slower when comes reading data off of disk.

finally, when caching dataframe, spark-sql use columnar compression format persisting data, should able put more in memory think this. allows efficient querying, don't have read data won't show in results.


Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -