diff --git a/GeostatsPy_declustering_all_methods.ipynb b/GeostatsPy_declustering_all_methods.ipynb index 3b89a67..6645063 100644 --- a/GeostatsPy_declustering_all_methods.ipynb +++ b/GeostatsPy_declustering_all_methods.ipynb @@ -7,21 +7,27 @@ }, "source": [ "\n", - "## Wide Array Declustering Distribution Representativity for Subsurface Data Analytics in Python \n", + "

\n", + " \n", "\n", + "

\n", "\n", - "### Michael Pyrcz, Professor, The University of Texas at Austin \n", + "## GeostatsPy Well-documented Demonstration Workflows \n", "\n", - "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n" + "### Spatial Data Declustering for Representative Statistics\n", + "\n", + "#### Michael Pyrcz, Professor, University of Texas at Austin \n", + "\n", + "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Wide Array Declustering Data Distribution Representativity Plotting in Python with GeostatsPy\n", + "This is a tutorial for / demonstration of **Spatial Data Declustering**. \n", "\n", - "Here's a workflow with the wide array approacch for declustering, e.g., use and interpret a variety of declustering methods. This should help you get started data declustering to address spatial sampling bias. The methods demonstrated here include:\n", + "**YouTube Lecture**: check out my lecture on [Sources of Spatial Bias](https://youtu.be/w0HgVibxpMQ?si=8sOBf4sUIRqfvOc7) and [Spatial Data Declustering](https://youtu.be/rN0RKcTIVcI?si=6EOJvDKEplGtkXKD). The methods demonstrated here include:\n", "\n", "* cell-based declustering\n", "* polygonal declsutering\n", @@ -69,25 +75,14 @@ "\n", "In this demonstration we will take a biased spatial sample data set and apply declustering using **GeostatsPy** functionality.\n", "\n", - "#### Getting Started\n", - "\n", - "Here's the steps to get setup in Python with the GeostatsPy package:\n", - "\n", - "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n", - "2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n", - "3. In the terminal type: pip install geostatspy. \n", - "4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n", + "#### Load the required libraries\n", "\n", - "You will need to copy the data file to your working directory. They are available here:\n", - "\n", - "* Tabular data - sample_data_biased.csv at https://git.io/fh0CW\n", - "\n", - "There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. " + "The following code loads the required libraries." ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -104,7 +99,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -136,7 +131,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -157,7 +152,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -175,11 +170,10 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ - "#df = pd.read_csv('sample_data_biased.csv') # load our data table (wrong name!)\n", "df = pd.read_csv(r'https://raw.githubusercontent.com/GeostatsGuy/GeoDataSets/master/sample_data_biased.csv')" ] }, @@ -194,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -250,115 +244,25 @@ " 0.135810\n", " 43.724752\n", " \n", - " \n", - " 3\n", - " 100\n", - " 500\n", - " 0\n", - " 0.094414\n", - " 1.609942\n", - " \n", - " \n", - " 4\n", - " 100\n", - " 100\n", - " 0\n", - " 0.113049\n", - " 10.886001\n", - " \n", - " \n", - " 5\n", - " 200\n", - " 800\n", - " 1\n", - " 0.154648\n", - " 106.491795\n", - " \n", - " \n", - " 6\n", - " 200\n", - " 700\n", - " 1\n", - " 0.153113\n", - " 140.976324\n", - " \n", - " \n", - " 7\n", - " 200\n", - " 500\n", - " 1\n", - " 0.126167\n", - " 12.548074\n", - " \n", - " \n", - " 8\n", - " 200\n", - " 400\n", - " 0\n", - " 0.094750\n", - " 1.208561\n", - " \n", - " \n", - " 9\n", - " 200\n", - " 100\n", - " 1\n", - " 0.150961\n", - " 44.687430\n", - " \n", - " \n", - " 10\n", - " 300\n", - " 800\n", - " 1\n", - " 0.199227\n", - " 1079.709291\n", - " \n", - " \n", - " 11\n", - " 300\n", - " 700\n", - " 1\n", - " 0.154220\n", - " 179.491695\n", - " \n", - " \n", - " 12\n", - " 300\n", - " 500\n", - " 1\n", - " 0.137502\n", - " 38.164911\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " X Y Facies Porosity Perm\n", - "0 100 900 1 0.115359 5.736104\n", - "1 100 800 1 0.136425 17.211462\n", - "2 100 600 1 0.135810 43.724752\n", - "3 100 500 0 0.094414 1.609942\n", - "4 100 100 0 0.113049 10.886001\n", - "5 200 800 1 0.154648 106.491795\n", - "6 200 700 1 0.153113 140.976324\n", - "7 200 500 1 0.126167 12.548074\n", - "8 200 400 0 0.094750 1.208561\n", - "9 200 100 1 0.150961 44.687430\n", - "10 300 800 1 0.199227 1079.709291\n", - "11 300 700 1 0.154220 179.491695\n", - "12 300 500 1 0.137502 38.164911" + " X Y Facies Porosity Perm\n", + "0 100 900 1 0.115359 5.736104\n", + "1 100 800 1 0.136425 17.211462\n", + "2 100 600 1 0.135810 43.724752" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#print(df.iloc[0:5,:]) # display first 4 samples in the table as a preview\n", - "df.head(n=13) # we could also use this command for a table preview" + "df.head(n=3) # we could also use this command for a table preview" ] }, { @@ -374,7 +278,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -484,7 +388,7 @@ "Perm 71.454424 5308.842566 " ] }, - "execution_count": 6, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -504,7 +408,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -524,7 +428,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -533,7 +437,7 @@ "" ] }, - "execution_count": 10, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -551,7 +455,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -882,345 +786,9 @@ "\n", "* calculate the weights as the normalized sum of the weights\n", "\n", - "Note, I extracted and modified the krige algorithm from GeostatsPy below to make a 2D kriging-based declustering method." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "def declus_kriging( \n", - " df,\n", - " xcol,\n", - " ycol,\n", - " vcol,\n", - " tmin,\n", - " tmax,\n", - " nx,\n", - " xmn,\n", - " xsiz,\n", - " ny,\n", - " ymn,\n", - " ysiz,\n", - " nxdis,\n", - " nydis,\n", - " ndmin,\n", - " ndmax,\n", - " radius,\n", - " ktype,\n", - " skmean,\n", - " vario,\n", - "):\n", - " \"\"\"GSLIB's KB2D program (Deutsch and Journel, 1998) converted from the\n", - " original Fortran to Python by Michael Pyrcz, the University of Texas at\n", - " Austin (Jan, 2019).\n", - " :param df: pandas DataFrame with the spatial data\n", - " :param xcol: name of the x coordinate column\n", - " :param ycol: name of the y coordinate column\n", - " :param vcol: name of the property column\n", - " :param tmin: property trimming limit\n", - " :param tmax: property trimming limit\n", - " :param nx: definition of the grid system (x axis)\n", - " :param xmn: definition of the grid system (x axis)\n", - " :param xsiz: definition of the grid system (x axis)\n", - " :param ny: definition of the grid system (y axis)\n", - " :param ymn: definition of the grid system (y axis)\n", - " :param ysiz: definition of the grid system (y axis)\n", - " :param nxdis: number of discretization points for a block\n", - " :param nydis: number of discretization points for a block\n", - " :param ndmin: minimum number of data points to use for kriging a block\n", - " :param ndmax: maximum number of data points to use for kriging a block\n", - " :param radius: maximum isotropic search radius\n", - " :param ktype:\n", - " :param skmean:\n", - " :param vario:\n", - " :return:\n", - " \"\"\"\n", - " \n", - "# Constants\n", - " UNEST = -999.\n", - " EPSLON = 1.0e-10\n", - " VERSION = 2.907\n", - " first = True\n", - " PMX = 9999.0 \n", - " MAXSAM = ndmax + 1\n", - " MAXDIS = nxdis * nydis\n", - " MAXKD = MAXSAM + 1\n", - " MAXKRG = MAXKD * MAXKD\n", - " \n", - "# load the variogram\n", - " nst = vario['nst']\n", - " cc = np.zeros(nst); aa = np.zeros(nst); it = np.zeros(nst)\n", - " ang = np.zeros(nst); anis = np.zeros(nst)\n", - " \n", - " c0 = vario['nug']; \n", - " cc[0] = vario['cc1']; it[0] = vario['it1']; ang[0] = vario['azi1']; \n", - " aa[0] = vario['hmaj1']; anis[0] = vario['hmin1']/vario['hmaj1'];\n", - " if nst == 2:\n", - " cc[1] = vario['cc2']; it[1] = vario['it2']; ang[1] = vario['azi2']; \n", - " aa[1] = vario['hmaj2']; anis[1] = vario['hmin2']/vario['hmaj2'];\n", - " \n", - "# Allocate the needed memory: \n", - " xdb = np.zeros(MAXDIS)\n", - " ydb = np.zeros(MAXDIS)\n", - " xa = np.zeros(MAXSAM)\n", - " ya = np.zeros(MAXSAM)\n", - " vra = np.zeros(MAXSAM)\n", - " dist = np.zeros(MAXSAM)\n", - " nums = np.zeros(MAXSAM)\n", - " r = np.zeros(MAXKD)\n", - " rr = np.zeros(MAXKD)\n", - " s = np.zeros(MAXKD)\n", - " a = np.zeros(MAXKRG)\n", - " kmap = np.zeros((nx,ny))\n", - " vmap = np.zeros((nx,ny))\n", - "\n", - "# Load the data\n", - " df_extract = df.loc[(df[vcol] >= tmin) & (df[vcol] <= tmax)] # trim values outside tmin and tmax\n", - " nd = len(df_extract)\n", - " ndmax = min(ndmax,nd)\n", - " x = df_extract[xcol].values\n", - " y = df_extract[ycol].values\n", - " vr = df_extract[vcol].values\n", - " \n", - "# Make a KDTree for fast search of nearest neighbours \n", - " dp = list((y[i], x[i]) for i in range(0,nd))\n", - " data_locs = np.column_stack((y,x))\n", - " tree = sp.cKDTree(data_locs, leafsize=16, compact_nodes=True, copy_data=False, balanced_tree=True)\n", - "\n", - "# Summary statistics for the data after trimming\n", - " avg = vr.mean()\n", - " stdev = vr.std()\n", - " ss = stdev**2.0\n", - " vrmin = vr.min()\n", - " vrmax = vr.max()\n", - " \n", - "# Track the sum of weights for declustering\n", - " sum_wts = np.zeros(nd)\n", - "\n", - "# Set up the discretization points per block. Figure out how many\n", - "# are needed, the spacing, and fill the xdb and ydb arrays with the\n", - "# offsets relative to the block center (this only gets done once):\n", - " ndb = nxdis * nydis\n", - " if ndb > MAXDIS: \n", - " print('ERROR KB2D: Too many discretization points ')\n", - " print(' Increase MAXDIS or lower n[xy]dis')\n", - " return kmap\n", - " xdis = xsiz / max(float(nxdis),1.0)\n", - " ydis = ysiz / max(float(nydis),1.0)\n", - " xloc = -0.5*(xsiz+xdis)\n", - " i = -1 # accounting for 0 as lowest index\n", - " for ix in range(0,nxdis): \n", - " xloc = xloc + xdis\n", - " yloc = -0.5*(ysiz+ydis)\n", - " for iy in range(0,nydis): \n", - " yloc = yloc + ydis\n", - " i = i+1\n", - " xdb[i] = xloc\n", - " ydb[i] = yloc\n", - "\n", - "# Initialize accumulators:\n", - " cbb = 0.0\n", - " rad2 = radius*radius\n", - "\n", - "# Calculate Block Covariance. Check for point kriging.\n", - " rotmat, maxcov = geostats.setup_rotmat(c0,nst,it,cc,ang,PMX)\n", - " cov = geostats.cova2(xdb[0],ydb[0],xdb[0],ydb[0],nst,c0,PMX,cc,aa,it,ang,anis,rotmat,maxcov)\n", - "# Keep this value to use for the unbiasedness constraint:\n", - " unbias = cov\n", - " first = False\n", - " if ndb <= 1:\n", - " cbb = cov\n", - " else:\n", - " for i in range(0,ndb): \n", - " for j in range(0,ndb): \n", - " cov = cova2(xdb[i],ydb[i],xdb[j],ydb[j],nst,c0,PMX,cc,aa,it,ang,anis,rotmat,maxcov)\n", - " if i == j: \n", - " cov = cov - c0\n", - " cbb = cbb + cov\n", - " cbb = cbb/real(ndb*ndb)\n", - "\n", - "# MAIN LOOP OVER ALL THE BLOCKS IN THE GRID:\n", - " nk = 0\n", - " ak = 0.0\n", - " vk = 0.0\n", - " for iy in range(0,ny):\n", - " yloc = ymn + (iy-0)*ysiz \n", - " for ix in range(0,nx):\n", - " xloc = xmn + (ix-0)*xsiz\n", - " current_node = (yloc,xloc)\n", - " \n", - "# Find the nearest samples within each octant: First initialize\n", - "# the counter arrays:\n", - " na = -1 # accounting for 0 as first index\n", - " dist.fill(1.0e+20)\n", - " nums.fill(-1)\n", - " dist, nums = tree.query(current_node,ndmax) # use kd tree for fast nearest data search\n", - " # remove any data outside search radius\n", - " na = len(dist)\n", - " nums = nums[dist UNEST:\n", - " nk = nk + 1\n", - " ak = ak + est\n", - " vk = vk + est*est\n", - "\n", - "# END OF MAIN LOOP OVER ALL THE BLOCKS:\n", - "\n", - " if nk >= 1:\n", - " ak = ak / float(nk)\n", - " vk = vk/float(nk) - ak*ak\n", - " print(' Estimated ' + str(nk) + ' blocks ')\n", - " print(' average ' + str(ak) + ' variance ' + str(vk))\n", - "\n", - "# standardize the kriging weights\n", - " sum_sum_wts = np.sum(sum_wts)\n", - " if sum_sum_wts <= 0.0:\n", - " sum_wts = np.ones(nd)\n", - " else:\n", - " sum_wts = sum_wts/sum_sum_wts*nd\n", - " \n", - " return sum_wts" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's assume a reasoanble grid for kriging\n", + "Note, I extracted and modified the krige algorithm from GeostatsPy below to make a 2D kriging-based declustering method.\n", + "\n", + "Let's assume a reasonable grid for kriging\n", "\n", "* sufficient resolution to capture the spatial context of the data\n", "\n", @@ -1280,12 +848,12 @@ "output_type": "stream", "text": [ " Estimated 10000 blocks \n", - " average 0.08916482425663337 variance 0.001716118320222105\n" + " average 0.08916482425663338 variance 0.001716118320222105\n" ] } ], "source": [ - "wts_kriging = declus_kriging(df,'X','Y','Porosity',tmin,tmax,nx,xmn,xsiz,ny,ymn,ysiz,nxdis,nydis,\n", + "wts_kriging = geostats.declus_kriging(df,'X','Y','Porosity',tmin,tmax,nx,xmn,xsiz,ny,ymn,ysiz,nxdis,nydis,\n", " ndmin,ndmax,radius,ktype,skmean,vario)" ] }, @@ -1426,7 +994,7 @@ "Correction of 0.0742.\n", "\n", "Summary statistics of the declustering weights:\n", - "DescribeResult(nobs=289, minmax=(-0.11154380045658596, 2.4301148927963636), mean=1.0, variance=0.3317473129698282, skewness=0.05096624432935332, kurtosis=-0.8372026953833722)\n" + "DescribeResult(nobs=289, minmax=(-0.11154380045658625, 2.4301148927963636), mean=1.0, variance=0.3317473129698282, skewness=0.050966244329353386, kurtosis=-0.8372026953833727)\n" ] }, { @@ -1484,113 +1052,21 @@ "\n", "### Polygonal Declustering\n", "\n", - "Let's take the standard 2D kriging method and modify it\n", + "Let's calculate the polygons with a discret approximation on a regular grid:\n", "\n", "* assign the nearest data value to each cell\n", "\n", "* calculate the weights as the normalized sum of the weights (number of cells assigned to each datum)\n", "\n", - "Note, I extracted and modified the krige algorithm from GeostatsPy below to make a 2D kriging-based declustering method." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "def polygonal_declus( \n", - " df,\n", - " xcol,\n", - " ycol,\n", - " vcol,\n", - " tmin,\n", - " tmax,\n", - " nx,\n", - " xmn,\n", - " xsiz,\n", - " ny,\n", - " ymn,\n", - " ysiz,\n", - "):\n", - " \"\"\"polygonal declustering Michael Pyrcz, The University of Texas at\n", - " Austin (Jan, 2019).\n", - " :param df: pandas DataFrame with the spatial data\n", - " :param xcol: name of the x coordinate column\n", - " :param ycol: name of the y coordinate column\n", - " :param vcol: name of the property column\n", - " :param tmin: property trimming limit\n", - " :param tmax: property trimming limit\n", - " :param nx: definition of the grid system (x axis)\n", - " :param xmn: definition of the grid system (x axis)\n", - " :param xsiz: definition of the grid system (x axis)\n", - " :param ny: definition of the grid system (y axis)\n", - " :param ymn: definition of the grid system (y axis)\n", - " :param ysiz: definition of the grid system (y axis)\n", - " :return:\n", - " \"\"\"\n", - " \n", - "# Allocate the needed memory: \n", - " grid = np.zeros((ny,nx))\n", - "\n", - "# Load the data\n", - " df_extract = df.loc[(df[vcol] >= tmin) & (df[vcol] <= tmax)] # trim values outside tmin and tmax\n", - " nd = len(df_extract)\n", - " x = df_extract[xcol].values\n", - " y = df_extract[ycol].values\n", - " vr = df_extract[vcol].values\n", - " \n", - "# Make a KDTree for fast search of nearest neighbours \n", - " dp = list((y[i], x[i]) for i in range(0,nd))\n", - " data_locs = np.column_stack((y,x))\n", - " tree = sp.cKDTree(data_locs, leafsize=16, compact_nodes=True, copy_data=False, balanced_tree=True)\n", - "\n", - "# Summary statistics for the data after trimming\n", - " avg = vr.mean()\n", - " stdev = vr.std()\n", - " ss = stdev**2.0\n", - " vrmin = vr.min()\n", - " vrmax = vr.max()\n", - " \n", - "# Track the sum of weights for declustering\n", - " sum_wts = np.zeros(nd)\n", - "\n", - "# MAIN LOOP OVER ALL THE BLOCKS IN THE GRID:\n", - " nk = 0\n", - " ak = 0.0\n", - " vk = 0.0\n", - " for iy in range(0,ny):\n", - " yloc = ymn + (iy-0)*ysiz \n", - " for ix in range(0,nx):\n", - " xloc = xmn + (ix-0)*xsiz\n", - " current_node = (yloc,xloc)\n", - "\n", - "# Find the nearest samples within each octant: First initialize\n", - "# the counter arrays:\n", - " dist, num = tree.query(current_node,1) # use kd tree for fast nearest data search\n", - " grid[ny-iy-1,ix] = num\n", - " sum_wts[num] = sum_wts[num] + 1 \n", - "\n", - "# END OF MAIN LOOP OVER ALL THE BLOCKS:\n", - "\n", - "# standardize the kriging weights\n", - " sum_sum_wts = np.sum(sum_wts)\n", - " if sum_sum_wts <= 0.0:\n", - " sum_wts = np.ones(nd)\n", - " else:\n", - " sum_wts = sum_wts/sum_sum_wts*nd\n", - " \n", - " return sum_wts,grid" + "Let's refine the grid for greater accuracy (with some increased run time, complexity is $O(n^2)$, and then run to calculate the polygonal weights and check the model by looking at the polygons and the data.\n", + "\n", + "* WARNING: This may take 1-2 minutes to run." ] }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "Let's refine the grid for greater accuracy (with some increased run time, complexity is $O(n^2)$, and then run to calculate the polygonal weights and check the model by looking at the polygons and the data.\n", - "\n", - "* WARNING: This may take 1-2 minutes to run." - ] + "source": [] }, { "cell_type": "code", @@ -1613,7 +1089,7 @@ "pnx = nx * factor; pny = ny * factor; pxsiz = xsiz / factor; pysiz = ysiz / factor\n", "pxmn = pxsiz / 2; pymn = pysiz / 2\n", "\n", - "wts_polygonal, polygons = polygonal_declus(df,'X','Y','Porosity',tmin,tmax,pnx,pxmn,pxsiz,pny,pymn,pysiz)\n", + "wts_polygonal, polygons = geostats.polygonal_declus(df,'X','Y','Porosity',tmin,tmax,pnx,pxmn,pxsiz,pny,pymn,pysiz)\n", "\n", "plt.subplot(111)\n", "cs = plt.imshow(polygons,interpolation = None,extent = [xmin,xmax,ymin,ymax], vmin = 0, vmax = len(df),cmap = plt.cm.inferno)\n",