cloudera cdh - Ensure that Impala query gets materialized -


it there reliable , efficient way ensure impala query results gets materialized without printing results console? example use inner join query.

the obvious way materialize query results create table select.

create table t3 stored parquet select t1.* t1 inner join t2 on t1.id=t2.id; 

the problem writes disc therefore inefficient. i'm looking efficient way execute query , ensure results materialized.

as example, in spark can use .cache method followed .count ensure query materialized.

val t3 = t1.join(t2, "id") t3.cache t3.count 

i try workaround sub-query.

select count(*) (select t1.* t1 inner join t2 on t1.id=t2.id) t3; 

but still need ensure sub-query materialized, not obvious if query optimizer discovers i'm interested in total count. maybe there hints enforce or other tricks?

afaik can't impala, , never able to.
cloudera designed tool support bi tools such tableau, qlik, microstrategy etc. -- not support ad hoc etl scripts.

on other hand hive ships "hpl-sql" procedural language wrapper might fit needs. caveats:

  • requires hive 2.0+
  • requires running whole script inside hpl-sql interpreter, not base hive client (nor standard jdbc connection)

and hpl-sql tool claims supports impala queries never investigated claim. solve problem, kind of clumsy workaround.

references:
  hive-11055 (pl/hql tool contributed hive code base)
  hpl/sql website


speaking of workarounds, why not use spark, suggested yourself? might read impala/hive tables, either spark native parquet libraries, or custom jdbc connection impala daemon. in essence similar hpl/sql solution.


Comments

Popular posts from this blog

javascript - Slick Slider width recalculation -

jsf - PrimeFaces Datatable - What is f:facet actually doing? -

angular2 services - Angular 2 RC 4 Http post not firing -