Bienvenue sur PostGIS.fr

Bienvenue sur PostGIS.fr , le site de la communauté des utilisateurs francophones de PostGIS.

PostGIS ajoute le support d'objets géographique à la base de données PostgreSQL. En effet, PostGIS "spatialise" le serverur PostgreSQL, ce qui permet de l'utiliser comme une base de données SIG.

Maintenu à jour, en fonction de nos disponibilités et des diverses sorties des outils que nous testons, nous vous proposons l'ensemble de nos travaux publiés en langue française.

source: trunk/workshop-foss4g/joins_advanced.rst @ 19

Revision 1, 7.8 KB checked in by djay, 13 years ago (diff)

Initial import of the svn tree

RevLine 
[1]1.. _joins_advanced:
2
3Section 19: More Spatial Joins
4==============================
5
6In the last section we saw the :command:`ST_Centroid(geometry)` and :command:`ST_Union([geometry])` functions, and some simple examples. In this section we will do some more elaborate things with them.
7
8.. _creatingtractstable:
9
10Creating a Census Tracts Table
11------------------------------
12
13In the workshop ``\data\`` directory, is a file that includes attribute data, but no geometry, ``nyc_census_sociodata.sql``. The table includes interesting socioeconomic data about New York: commute times, incomes, and education attainment. There is just one problem. The data are summarized by "census tract" and we have no census tract spatial data!
14
15In this section we will
16
17 * Load the ``nyc_census_sociodata.sql`` table
18 * Create a spatial table for census tracts
19 * Join the attribute data to the spatial data
20 * Carry out some analysis using our new data
21 
22Loading nyc_census_sociodata.sql
23~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
24
25 #. Open the SQL query window in PgAdmin
26 #. Select **File->Open** from the menu and browse to the ``nyc_census_sociodata.sql`` file
27 #. Press the "Run Query" button
28 #. If you press the "Refresh" button in PgAdmin, the list of tables should now include at ``nyc_census_sociodata`` table
29 
30Creating a Census Tracts Table
31~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32 
33As we saw in the previous section, we can build up higher level geometries from the census block by summarizing on substrings of the ``blkid`` key. In order to get census tracts, we need to summarize grouping on the first 11 characters of the ``blkid``.
34 
35  ::
36
37    360610001009000 = 36 061 00100 9000
38
39    36     = State of New York
40    061    = New York County (Manhattan)
41    000100 = Census Tract
42    9      = Census Block Group
43    000    = Census Block
44
45Create the new table using the :command:`ST_Union` aggregate:
46 
47.. code-block:: sql
48   
49   -- Make the tracts table
50   CREATE TABLE nyc_census_tract_geoms AS
51   SELECT
52     ST_Union(the_geom) AS the_geom,
53     SubStr(blkid,1,11) AS tractid
54   FROM nyc_census_blocks
55   GROUP BY tractid;
56     
57   -- Index the tractid
58   CREATE INDEX nyc_census_tract_geoms_tractid_idx ON nyc_census_tract_geoms (tractid);
59     
60   -- Update the geometry_columns table
61   SELECT Populate_Geometry_Columns();
62
63Join the Attributes to the Spatial Data
64~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
65
66Join the table of tract geometries to the table of tract attributes with a standard attribute join
67 
68.. code-block:: sql
69 
70  -- Make the tracts table
71  CREATE TABLE nyc_census_tracts AS
72  SELECT
73    g.the_geom,
74    a.*
75  FROM nyc_census_tract_geoms g
76  JOIN nyc_census_sociodata a
77  ON g.tractid = a.tractid;
78   
79  -- Index the geometries
80  CREATE INDEX nyc_census_tract_gidx ON nyc_census_tracts USING GIST (the_geom);
81   
82  -- Update the geometry_columns table
83  SELECT Populate_Geometry_Columns();
84
85.. _interestingquestion:
86
87Answer an Interesting Question
88~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
89     
90Answer an interesting question! "List top 10 New York neighborhoods ordered by the proportion of people who have graduate degrees."
91 
92.. code-block:: sql
93 
94  SELECT
95    Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct,
96    n.name, n.boroname
97  FROM nyc_neighborhoods n
98  JOIN nyc_census_tracts t
99  ON ST_Intersects(n.the_geom, t.the_geom)
100  WHERE t.edu_total > 0
101  GROUP BY n.name, n.boroname
102  ORDER BY graduate_pct DESC
103  LIMIT 10;
104
105We sum up the statistics we are interested, then divide them together at the end. In order to avoid divide-by-zero errors, we don't bother bringing in tracts that have a population count of zero.
106
107::
108 
109   graduate_pct |       name        | boroname 
110  --------------+-------------------+-----------
111           40.4 | Carnegie Hill     | Manhattan
112           40.2 | Flatbush          | Brooklyn
113           34.8 | Battery Park      | Manhattan
114           33.9 | North Sutton Area | Manhattan
115           33.4 | Upper West Side   | Manhattan
116           33.3 | Upper East Side   | Manhattan
117           32.0 | Tribeca           | Manhattan
118           31.8 | Greenwich Village | Manhattan
119           29.8 | West Village      | Manhattan
120           29.7 | Central Park      | Manhattan
121   
122 
123.. _polypolyjoins:
124
125Polygon/Polygon Joins
126---------------------
127
128In our interesting query (in :ref:`interestingquestion`) we used the :command:`ST_Intersects(geometry_a, geometry_b)` function to determine which census tract polygons to include in each neighborhood summary. Which leads to the question: what if a tract falls on the border between two neighborhoods? It will intersect both, and so will be included in the summary statistics for **both**.
129
130.. image:: ./screenshots/centroid_neighborhood.png
131
132To avoid this kind of double counting there are two methods:
133
134 * The simple method is to ensure that each tract only falls in **one** summary area (using :command:`ST_Centroid(geometry)`)
135 * The complex method is to divide crossing tracts at the borders (using :command:`ST_Intersection(geometry,geometry)`)
136 
137Here is an example of using the simple method to avoid double counting in our graduate education query:
138
139.. code-block:: sql
140
141  SELECT
142    Round(100.0 * Sum(t.edu_graduate_dipl) / Sum(t.edu_total), 1) AS graduate_pct,
143    n.name, n.boroname
144  FROM nyc_neighborhoods n
145  JOIN nyc_census_tracts t
146  ON ST_Contains(n.the_geom, ST_Centroid(t.the_geom))
147  WHERE t.edu_total > 0
148  GROUP BY n.name, n.boroname
149  ORDER BY graduate_pct DESC
150  LIMIT 10;
151 
152Note that the query takes longer to run now, because the :command:`ST_Centroid` function  has to be run on every census tract.
153
154::
155
156   graduate_pct |       name        | boroname 
157  --------------+-------------------+-----------
158           49.2 | Carnegie Hill     | Manhattan
159           39.5 | Battery Park      | Manhattan
160           34.3 | Upper East Side   | Manhattan
161           33.6 | Upper West Side   | Manhattan
162           32.5 | Greenwich Village | Manhattan
163           32.2 | Tribeca           | Manhattan
164           31.3 | North Sutton Area | Manhattan
165           30.8 | West Village      | Manhattan
166           30.1 | Downtown          | Brooklyn
167           28.4 | Cobble Hill       | Brooklyn
168 
169Avoiding double counting changes the results!
170
171
172.. _largeradiusjoins:
173
174Large Radius Distance Joins
175---------------------------
176
177A query that is fun to ask is "How do the commute times of people near (within 500 meters) subway stations differ from those of people far away from subway stations?"
178
179However, the question runs into some problems of double counting: many people will be within 500 meters of multiple subway stations. Compare the population of New York:
180
181.. code-block:: sql
182
183  SELECT Sum(popn_total)
184  FROM nyc_census_blocks;
185 
186::
187
188  8008278
189 
190With the population of the people in New York within 500 meters of a subway station:
191
192.. code-block:: sql
193
194  SELECT Sum(popn_total)
195  FROM nyc_census_blocks census
196  JOIN nyc_subway_stations subway
197  ON ST_DWithin(census.the_geom, subway.the_geom, 500);
198 
199::
200
201  10556898
202
203There's more people close to the subway than there are people! Clearly, our simple SQL is making a big double-counting error. You can see the problem looking at the picture of the buffered subways.
204
205.. image:: ./screenshots/subways_buffered.png
206
207The solution is to ensure that we have only distinct census blocks before passing them into the summarization portion of the query. We can do that by breaking our query up into a subquery that finds the distinct blocks, wrapped in a summarization query that returns our answer:
208
209.. code-block:: sql
210
211  SELECT Sum(popn_total)
212  FROM (
213    SELECT DISTINCT ON (blkid) popn_total
214    FROM nyc_census_blocks census
215    JOIN nyc_subway_stations subway
216    ON ST_DWithin(census.the_geom, subway.the_geom, 500)
217  ) AS distinct_blocks;
218 
219::
220
221  4953599
222
223That's better! So a bit over half the population of New York is within 500m (about a 5-7 minute walk) of the subway.
224
225
226
Note: See TracBrowser for help on using the repository browser.