google bigquery - Random sampling complete rows -
i know question 1 can random sampling rand
.
select * [table] rand() < percentage
but require full table scan , incur equivalent cost. i'm wondering if there more efficient ways?
i'm experimenting tabledata.list
api got java.net.sockettimeoutexception: read timed out
when index
large (i.e. > 10000000). operation not o(1)?
bigquery .tabledata() .list(tableref.getprojectid, tableref.getdatasetid, tableref.gettableid) .setstartindex(index) .setmaxresults(1l) .execute()
i recommend paging tabledata.list pagetoken , collect sample rows each page. should scale better.
another (totally different) option see use of table decorators
can in loop grammatically generate random time (for snapshot) or time-frame (for range) , query portions of data extracting needed data.
note limitation: allow sample data less 7 days old.
Comments
Post a Comment