{
  "version": "https://jsonfeed.org/version/1",
  "title": "Ian's Digital Garden",
  "home_page_url": "https://ianwwagner.com/",
  "feed_url": "https://ianwwagner.com//tag-gis.json",
  "description": "",
  "items": [
    {
      "id": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "url": "https://ianwwagner.com//how-and-why-to-work-with-arrow-and-duckdb-in-rust.html",
      "title": "How (and why) to work with Arrow and DuckDB in Rust",
      "content_html": "<p>My day job involves wrangling a lot of data very fast.\nI've heard a lot of people raving about several technologies like DuckDB,\n(Geo)Parquet, and Apache Arrow recently.\nBut despite being an &quot;early adopter,&quot;\nit took me quite a while to figure out how and why to leverage these practiclaly.</p>\n<p>Last week, a few things &quot;clicked&quot; for me, so I'd like to share what I learned in case it helps you.</p>\n<h1><a href=\"#geoparquet\" aria-hidden=\"true\" class=\"anchor\" id=\"geoparquet\"></a>(Geo)Parquet</h1>\n<p>(Geo)Parquet is quite possibly the best understood tech in the mix.\nIt is not exactly new.\nParquet has been around for quite a while in the big data ecosystem.\nIf you need a refresher, the <a href=\"https://guide.cloudnativegeo.org/geoparquet/\">Cloud-optimized Geospatial Formats Guide</a>\ngives a great high-level overview.</p>\n<p>Here are the stand-out features:</p>\n<ul>\n<li>It has a schema and some data types, unlike CSV (you can even have maps and lists!).</li>\n<li>On disk, values are written in groups per <em>column</em>, rather than writing one row at a time.\nThis makes the data much easier to compress, and lets readers easily skip over data they don't need.</li>\n<li>Statistics at several levels which enable &quot;predicate pushdown.&quot; Even though the files are columnar in nature,\nyou can narrow which files and &quot;row groups&quot; within each file have the data you need!</li>\n</ul>\n<p>Practically speaking, parquet lets you can distribute large datasets in <em>one or more</em> files\nwhich will be significantly <em>smaller and faster to query</em> than other familiar formats.</p>\n<h2><a href=\"#why-you-should-care\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care\"></a>Why you should care</h2>\n<p>The value proposition is clear for big data processing.\nIf you're trying to get a record of all traffic accidents in California,\nor find the hottest restaurants in Paris based on a multi-terabyte dataset,\nparquet provides clear advantages.\nYou can skip row groups within the parquet file or even whole files\nto narrow your search!\nAnd since datasets can be split across files,\nyou can keep adding to the dataset over time, parallelize queries,\nand other nice things.</p>\n<p>But what if you're not doing these high-level analytical things?\nWhy not just use a more straightforward format like CSV\nthat avoids the need to &quot;rotate&quot; back into rows\nfor non-aggregation use cases?\nHere are a few reasons to like Parquet:</p>\n<ul>\n<li>You actually have a schema! This means less format shifting and validation in your code.</li>\n<li>Operating on row groups turns out to be pretty efficient, even when you're reading the whole dataset.\nCombining batch reads with compression, your processing code will usually get faster.</li>\n<li>It's designed to be readable from object storage.\nThis means you can often process massive datasets from your laptop.\nParquet readers are smart and can skip over data you don't need.\nYou can't do this with CSV.</li>\n</ul>\n<p>The upshot of all this is that it generally gets both <em>easier</em> and <em>faster</em>\nto work with your data...\nprovided that you have the right tools to leverage it.</p>\n<h1><a href=\"#duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"duckdb\"></a>DuckDB</h1>\n<p>DuckDB describes itself as an in-process, portable, feature-rich, and fast database\nfor analytical workloads.\nDuckDB was that tool that triggered my &quot;lightbulb moment&quot; last week.\nFoursquare, an app which I've used for a decade or more,\nrecently released an <a href=\"https://location.foursquare.com/resources/blog/products/foursquare-open-source-places-a-new-foundational-dataset-for-the-geospatial-community/\">open data set</a>,\nwhich was pretty cool!\nIt was also in Parquet format (just like <a href=\"https://overturemaps.org/\">Overture</a>'s data sets).</p>\n<p>You can't just open up a Parquet file in a text editor or spreadsheet software like you can a CSV.\nMy friend Oliver released a <a href=\"https://wipfli.github.io/foursquare-os-places-pmtiles/\">web-based demo</a>\na few weeks ago which lets you inspect the data on a map at the point level.\nBut to do more than spot checking, you'll probably want a database that can work with Parquet.\nAnd that's where DuckDB comes in.</p>\n<h2><a href=\"#why-you-should-care-1\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-1\"></a>Why you should care</h2>\n<h3><a href=\"#its-embedded\" aria-hidden=\"true\" class=\"anchor\" id=\"its-embedded\"></a>It's embedded</h3>\n<p>I understood the in-process part of DuckDB's value proposition right away.\nIt's similar to SQLite, where you don't have to go through a server\nor over an HTTP connection.\nThis is both simpler to reason about and <a href=\"quadrupling-the-performance-of-a-data-pipeline.html\">is usually quite a bit faster</a>\nthan having to call out to a separate service!</p>\n<p>DuckDB is pretty quick to compile from source.\nYou probably don't need to muck around with this if you're just using the CLI,\nbut I wanted to eventually use it embedded in some Rust code.\nCompiling from source turned out to be the easiest way to get their crate working.\nIt looks for a shared library by default, but I couldn't get this working after a <code>brew</code> install.\nThis was mildly annoying, but on the other hand,\nvendoring the library does make consistent Docker builds easier 🤷🏻‍♂️</p>\n<h3><a href=\"#features-galore\" aria-hidden=\"true\" class=\"anchor\" id=\"features-galore\"></a>Features galore!</h3>\n<p>DuckDB includes a mind boggling number of features.\nNot in a confusing way; more in a Python stdlib way where just about everything you'd want is already there.\nYou can query a whole directory (or bucket) of CSV files,\na Postgres database, SQLite, or even an OpenStreetMap PBF file 🤯\nYou can even write a SQL query against a glob expression of Parquet files in S3\nas your &quot;table.&quot;\n<strong>That's really cool!</strong>\n(If you've been around the space, you may recognize this concept from\nAWS Athena and others.)</p>\n<h3><a href=\"#speed\" aria-hidden=\"true\" class=\"anchor\" id=\"speed\"></a>Speed</h3>\n<p>Writing a query against a local directory of files is actually really fast!\nIt does a bit of munging upfront, and yes,\nit's not quite as fast as if you'd prepped the data into a clean table,\nbut you actually can run quite efficient queries this way locally!</p>\n<p>When running a query against local data,\nDuckDB will make liberal use of your system memory\n(the default is 80% of system RAM)\nand as many CPUs as you can throw at it.\nBut it will reward you with excellent response times,\ncourtesy of the &quot;vectorized&quot; query engine.\nWhat I've heard of the design reminds me of how array-oriented programming languages like APL\n(or less esoteric libraries like numpy) are often implemented.</p>\n<p>I was able to do some spatial aggregation operations\n(bucketing a filtered list of locations by H3 index)\nin about <strong>10 seconds on a dataset of more than 40 million rows</strong>!\n(The full dataset is over 100 million rows, so I also got to see the selective reading in action.)\nThat piqued my interest, to say the least.\n(Here's the result of that query, visualized).</p>\n<p><figure><img src=\"media/foursquare-os-places-density-2024.png\" alt=\"A map of the world showing heavy density in the US, southern Canada, central Mexico, parts of coastal South America, Europe, Korea, Japan, parts of SE Aaia, and Australia\" /></figure></p>\n<h3><a href=\"#that-analytical-thing\" aria-hidden=\"true\" class=\"anchor\" id=\"that-analytical-thing\"></a>That analytical thing...</h3>\n<p>And now for the final buzzword in DuckDB's marketing: analytical.\nDuckDB frequently describes itself as optimized for OLAP (OnLine Analytical Processing) workloads.\nThis is contrasted with OLTP (OnLine Transaction Processing).\n<a href=\"https://en.wikipedia.org/wiki/Online_analytical_processing\">Wikipedia</a> will tell you some differences\nin a lot of sweepingly broad terms, like being used for &quot;business reporting&quot; and read operations\nrather than &quot;transactions.&quot;</p>\n<p>When reaching for a definition, many sources focus on things like <em>aggregation</em> queries\nas a differentiator.\nThis didn't help, since most of my use cases involve slurping most or all of the data set.\nThe DuckDB marketing and docs didn't help clarify things either.</p>\n<p>Let me know on Mastodon if you have a better explanatation of what an &quot;analytical&quot; database is 🤣</p>\n<p>I think a better explanation is probably 1) you do mostly <em>read</em> queries,\nand 2) it can execute highly parallel queries.\nSo far, DuckDB has been excellent for both the &quot;aggregate&quot; and the &quot;iterative&quot; use case.\nI assume it's just not the best choice per se if your workload is a lot of single-record writes?</p>\n<h2><a href=\"#how-im-using-duckdb\" aria-hidden=\"true\" class=\"anchor\" id=\"how-im-using-duckdb\"></a>How I'm using DuckDB</h2>\n<p>Embedding DuckDB in a Rust project allowed me to deliver something with a better end-user experience,\nis easier to maintain,\nand saved writing hundreds of lines of code in the process.</p>\n<p>Most general-purpose languages like Python and Rust\ndon't have primitives for expressing things like joins across datasets.\nDuckDB, like most database systems, does!\nYes, I <em>could</em> write some code using the <code>parquet</code> crate\nthat would filter across a nested directory tree of 5,000 files.\nBut DuckDB does that out of the box!</p>\n<p>It feels like this is a &quot;regex moment&quot; for data processing.\nJust like you don't (usually) need to hand-roll string processing,\nthere's now little reason to hand-roll data aggregation.</p>\n<p>For the above visualization, I used the Rust DuckDB crate for the data processing,\nconverted the results to JSON,\nand served it up from an Axum web server.\nAll in a <em>single binary</em>!\nThat's lot nicer than a bash script that executes SQL,\ndumps to a file, and then starts up a Python or Node web server!\nAnd breaks when you don't have Python or Node installed,\nyour OS changes its default shell,\nyou forget that some awk flag doesn't work on the GNU version,\nand so on.</p>\n<h1><a href=\"#apache-arrow\" aria-hidden=\"true\" class=\"anchor\" id=\"apache-arrow\"></a>Apache Arrow</h1>\n<p>The final thing I want to touch on is <a href=\"https://arrow.apache.org/\">Apache Arrow</a>.\nThis is yet another incredibly useful technology which I've been following for a while,\nbut never quite figured out how to properly use until last week.</p>\n<p>Arrow is a <em>language-independent memory format</em>\nthat's <em>optimized for efficient analytic operations</em> on modern CPUs and GPUs.\nThe core idea is that, rather than having to convert data from one format to another (this implies copying!),\nArrow defines as shared memory format which many systems understand.\nIn practice, this ends up being a bunch of standards which define common representations for different types,\nand libraries for working with them.\nFor example, the <a href=\"https://geoarrow.org/\">GeoArrow</a> spec\nbuilds on the Arrow ecosystem to enable operations on spatial data in a common memory format.\nPretty cool!</p>\n<h2><a href=\"#why-you-should-care-2\" aria-hidden=\"true\" class=\"anchor\" id=\"why-you-should-care-2\"></a>Why you should care</h2>\n<p>It turns out that copying and format shifting data can really eat into your processing times.\nArrow helps you sidestep that by reducing the amount of both you'll need to do,\nand by working on data in groups.</p>\n<h2><a href=\"#how-the-heck-to-use-it\" aria-hidden=\"true\" class=\"anchor\" id=\"how-the-heck-to-use-it\"></a>How the heck to use it?</h2>\n<p>Arrow is mostly hidden from view beneath other libraries.\nSo most of the time, especially if you're writing in a very high level language like Python,\nyou won't even see it.</p>\n<p>But if you're writing something at a slightly lower level,\nit's something you may have to touch for critical sections.\nThe <a href=\"https://docs.rs/duckdb/latest/duckdb/\">DuckDB crate</a>\nincludes an <a href=\"https://docs.rs/duckdb/latest/duckdb/struct.Statement.html#method.query_arrow\">Arrow API</a>\nwhich will give you an iterator over <code>RecordBatch</code>es.\nThis is pretty convenient, since you can use DuckDB to gather all your data\nand just consume the stream of batches!</p>\n<p>So, how do we work with <code>RecordBatch</code>es?\nThe Arrow ecosystem, like Parquet, takes a lot of work to understand,\nand using the low-level libraries directly is difficult.\nEven as a seasoned Rustacean, I found the docs rather obtuse.</p>\n<p>After some searching, I finally found <a href=\"https://docs.rs/serde_arrow/\"><code>serde_arrow</code></a>.\nIt builds on the <code>serde</code> ecosystem with easy-to-use methods that operate on <code>RecordBatch</code>es.\nFinally; something I can use!</p>\n<p>I was initilaly worried about how performant the shift from columns to rows + any (minimal) <code>serde</code> overhead would be,\nbut this turned out to not be an issue.</p>\n<p>Here's how the code looks:</p>\n<pre><code class=\"language-rust\">serde_arrow::from_record_batch::&lt;Vec&lt;FoursquarePlaceRecord&gt;&gt;(&amp;batch)\n</code></pre>\n<p>A few combinators later and you've got a proper data pipeline!</p>\n<h1><a href=\"#review-what-this-enables\" aria-hidden=\"true\" class=\"anchor\" id=\"review-what-this-enables\"></a>Review: what this enables</h1>\n<p>What this ultimately enabled for me was being able to get a lot closer to &quot;scripting&quot;\na pipeline in Rust.\nMost people turn to Python or JavaScript for tasks like this,\nbut Rust has something to add: strong typing and all the related guarantees <em>which can only come with some level of formalism</em>.\nBut that doesn't necessarily have to get in the way of productivity!</p>\n<p>Hopefully this sparks some ideas for making your next data pipeline both fast and correct.</p>\n",
      "summary": "",
      "date_published": "2024-12-08T00:00:00-00:00",
      "image": "media/foursquare-os-places-density-2024.png",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "rust",
        "apache arrow",
        "parquet",
        "duckdb",
        "big data",
        "data engineering",
        "gis"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "url": "https://ianwwagner.com//quadrupling-the-performance-of-a-data-pipeline.html",
      "title": "Quadrupling the Performance of a Data Pipeline",
      "content_html": "<p>Over the past two weeks, I've been focused on optimizing some data pipelines.\nI inherited some old ones which seemed especially slow,\nand I finally hit a limit where an overhaul made sense.\nThe pipelines process and generate data on the order of hundreds of gigabytes,\nrequiring correlation and conflated across several datasets.</p>\n<p>The pipelines in question happened to be written in Node.js,\nwhich I will do my absolute best not to pick on too much throughout.\nNode is actually a perfectly fine solution for certain problems,\nbut was being used especially badly in this case.\nThe rewritten pipeline, using Rust, clocked in at 4x faster than the original.\nBut as we'll soon see, the choice of language wasn't even the main factor in the sluggishness.</p>\n<p>So, let's get into it...</p>\n<h1><a href=\"#problem-1-doing-cpu-bound-work-on-a-single-thread\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-1-doing-cpu-bound-work-on-a-single-thread\"></a>Problem 1: Doing CPU-bound work on a single thread</h1>\n<p>Node.js made a splash in the early 2010s,\nand I can remember a few years where it was the hot new thing to write everything in.\nOne of the selling points was its ability to handle thousands (or tens of thousands)\nof connections with ease; all from JavaScript!\nThe key to this performance is <strong>async I/O</strong>.\nModern operating systems are insanely good at this, and Node made it <em>really</em> easy to tap into it.\nThis was novel to a lot of developers at the time, but it's pretty standard now\nfor building I/O heavy apps.</p>\n<p><strong>Node performs well as long as you were dealing with I/O-bound workloads</strong>,\nbut the magic fades if your workload requires a lot of CPU work.\nBy default, Node is single-threaded.\nYou need to bring in <code>libuv</code>, worker threads (Node 10 or so), or something similar\nto access <em>parallel</em> processing from JavaScript.\nI've only seen a handful of Node programs use these,\nand the pipelines in question were not among them.</p>\n<h2><a href=\"#going-through-the-skeleton\" aria-hidden=\"true\" class=\"anchor\" id=\"going-through-the-skeleton\"></a>Going through the skeleton</h2>\n<p>If you ingest data files (CSV and the like) record-by-record in a naïve way,\nyou'll just read one record at a time, process, insert to the database, and so on in a loop.\nThe original pipeline code was fortunately not quite this bad (it did have batching at least),\nbut had some room for improvemnet.</p>\n<p>The ingestion phase, where you're just reading data from CSV, parquet, etc.\nmaps naturally to Rust's <a href=\"https://rust-lang.github.io/async-book/05_streams/01_chapter.html\">streams</a>\n(the cousin of futures).\nThe original node code was actually fine at this stage,\nif a bit less elegant.\nBut the Rust structure we settled on is worth a closer look.</p>\n<pre><code class=\"language-rust\">fn csv_record_stream&lt;'a, S: AsyncRead + Unpin + Send + 'a, T: TryFrom&lt;StringRecord&gt;&gt;(\n    stream: S,\n    delimiter: u8,\n) -&gt; impl Stream&lt;Item = T&gt; + 'a\nwhere\n    &lt;T as TryFrom&lt;StringRecord&gt;&gt;::Error: Debug,\n{\n    let reader = AsyncReaderBuilder::new()\n        .delimiter(delimiter)\n        // Other config elided...\n        .create_reader(stream);\n    reader.into_records().filter_map(|res| async move {\n        let Ok(record) = res else {\n            log::error!(&quot;Error reading from the record stream: {:?}&quot;, res);\n            return None;\n        };\n\n        match T::try_from(record) {\n            Ok(parsed) =&gt; Some(parsed),\n            Err(e) =&gt; {\n                log::error!(&quot;Error parsing record: {:?}.&quot;, e);\n                None\n            }\n        }\n    })\n}\n</code></pre>\n<p>It starts off dense, but the concept is simple.\nWe'll take an async reader,\nconfigure a CSV reader to pull records for it,\nand map them to another data type using <code>TryFrom</code>.\nIf there are any errors, we just drop them from the stream and log an error.\nThis usually isn't a reason to stop processing for our use case.</p>\n<p>You should <em>not</em> be putting expensive code in your <code>TryFrom</code> implementation.\nBut really quick things like verifying that you have the right number of fields,\nor that a field contains an integer or is non-blank are usually fair game.</p>\n<p>Rust's trait system really shines here.\nOur code can turn <em>any</em> CSV(-like) file\ninto an arbitrary record type.\nAnd the same techniques can apply to just about any other data format too.</p>\n<h2><a href=\"#how-to-use-tokio-for-cpu-bound-operations\" aria-hidden=\"true\" class=\"anchor\" id=\"how-to-use-tokio-for-cpu-bound-operations\"></a>How to use Tokio for CPU-bound operations?</h2>\n<p>Now that we've done the light format shifting and discarded some obviously invalid records,\nlet's turn to the heavier processing.</p>\n<pre><code class=\"language-rust\">let available_parallelism = std::thread::available_parallelism()?.get();\n// let record_pipeline = csv_record_stream(...);\nrecord_pipeline\n    .chunks(500)  // Batch the work (your optimal size may vary)\n    .for_each_concurrent(available_parallelism, |chunk| {\n        // Clone your database connection pool or whatnot before `move`\n        // Every app is different, but this is a pretty common pattern\n        // for sqlx, Elastic Search, hyper, and more which use Arcs and cheap clones for pools.\n        let db_pool = db_pool.clone();\n        async move {\n            // Process your records using a blocking threadpool\n            let documents = tokio::task::spawn_blocking(move || {\n                // Do the heavy work here!\n                chunk\n                    .into_iter()\n                    .map(do_heavy_work)\n                    .collect()\n            })\n            .await\n            .expect(&quot;Problem spawning a blocking task&quot;);\n\n            // Insert processesd data to your database\n            db_pool.bulk_insert(documents).await.expect(&quot;You probably need an error handling strategy here...&quot;);\n        }\n    })\n    .await;\n</code></pre>\n<p>We used the <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.chunks\"><code>chunks</code></a>\nadaptor to pull hundreds of items at a time for more efficient processing in batches.\nThen, we used <a href=\"https://docs.rs/futures/latest/futures/stream/trait.StreamExt.html#method.for_each_concurrent\"><code>for_each_concurrent</code></a>\nin conjunction with <a href=\"https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html\"><code>spawn_blocking</code></a>\nto introduce parallel processing.</p>\n<p>Note that neither <code>chunks</code> nor even <code>for_each_concurrent</code> imply any amount of <em>parallelism</em>\non their own.\n<code>spawn_blocking</code> is the only thing that can actually create a new thread of execution!\nChunking simply splits the work into batches (most workloads like this tend to benefit from batching).\nAnd <code>for_each_concurrent</code> allows for <em>concurrent</em> operations over multiple batches.\nBut <code>spawn_blocking</code> is what enables computation in a background thread.\nIf you don't use <code>spawn_blocking</code>,\nyou'll end up blocking Tokio's async workers,\nand your performance will tank.\nJust like the old Node.js code.</p>\n<p>The astute reader may point out that using <code>spawn_blocking</code> like this\nis not universally accepted as a solution.\nTokio is (relatively) optimized for non-blocking workloads, so some claim that you should avoid this pattern.\nBut my experience having done this for 5+ years in production code serving over 2 billion requests/month,\nis that Tokio can be a great scheduler for heavier tasks too!</p>\n<p>One thing that's often overlooked in these discussions\nis that not all &quot;long-running operations&quot; are the same.\nOne category consists of graphics event loops,\nlong-running continuous computations,\nor other things that may not have an obvious &quot;end.&quot;\nBut some tasks <em>can</em> be expected to complete within some period of time,\nthat's longer than a blink.</p>\n<p>In the case of the former (&quot;long-lived&quot; tasks), then spawning a dedicated thread often makes sense.\nIn the latter scenario though, Tokio tasks with <code>spawn_blocking</code> can be a great choice.</p>\n<p>For our workload, we were doing a lot of the latter sort of operation.\nOne helpful rule of thumb I've seen is that if your task takes longer than tens of microseconds,\nyou should move it off the Tokio worker threads.\nUsing <code>chunks</code> and <code>spawn_blocking</code> avoids this death by a thousand cuts.\nIn our case, the parallelism resulted in a VERY clear speedup.</p>\n<h1><a href=\"#problem-2-premature-optimization-rather-than-backpressure\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-2-premature-optimization-rather-than-backpressure\"></a>Problem 2: Premature optimization rather than backpressure</h1>\n<p>The original data pipeline was very careful to not overload the data store.\nPerhaps a bit too careful!\nThis may have been necessary at some point in the distant past,\nbut most data storage, from vanilla databases to multi-node clustered storage,\nhave some level of natural backpressure built-in.\nThe Node implementation was essentially limiting the amount of work in-flight that hadn't been flushed.</p>\n<p>This premature optimization and the numerous micro-pauses it introduced\nwere another death by a thousand cuts problem.\nDropping the artificial limits approximately doubled throughput.\nIt turned out that our database was able to process 2-4x more records than under the previous implementation.</p>\n<p><strong>TL;DR</strong> — set a reasonable concurrency, let the server tell you when it's chugging (usually via slower response times),\nand let your async runtime handle the rest!</p>\n<h1><a href=\"#problem-3-serde-round-trips\" aria-hidden=\"true\" class=\"anchor\" id=\"problem-3-serde-round-trips\"></a>Problem 3: Serde round-trips</h1>\n<p>Serde, or serialization + deserialization, can be a silent killer.\nAnd unless you're tracking things carefully, you often won't notice!</p>\n<p>I recently listened to <a href=\"https://www.recodingamerica.us/\">Recoding America</a> at the recommendation of a friend.\nOne of the anecdotes made me want to laugh and cry at the same time.\nEngineers had designed a major improvemnet to GPS, but the rollout is delayed\ndue to a performance problem that renders it unusable.</p>\n<p>The project is overseen by Raytheyon, a US government contractor.\nAnd they can't deliver because some arcane federal guidance (not even a regulation proper)\n&quot;recommends&quot; an &quot;Enterprise Service Bus&quot; in the architecture.\nThe startuppper in me dies when I hear such things.\nThe &quot;recommendation&quot; boils down to a data exchange medium where one &quot;service&quot; writes data and another consumes it.\nThink message queues like you may have used before.</p>\n<p>This is fine (even necessary) for some applications,\nbut positively crippling for others.\nIn the case of the new positioning system,\nwhich was heavily dependent on timing,\nthis was a wildly inefficient architecture.\nEven worse, the guidelines stated that it should be encrypted.</p>\n<p>This wasn't even &quot;bad&quot; guidance, but in the context of the problem,\nwhich depended on rapid exchange of time-sensitive messages,\nit was a horrendously bad fit.</p>\n<p>In our data pipeline, I discovered a situation with humorous resemblance in retrospect.\nThe pipeline was set up using a microservice architecture,\nwhich I'm sure souded like a good idea at the time,\nbut it introduced some truly obscene overhead.\nAll services involved were capable of working with data in the same format,\nbut the Node.js implementation was split into multiple services with HTTP and JSON round trips in the middle!\nDouble whammy!</p>\n<p>The new data pipeline simply imports the &quot;service&quot; as a crate,\nand gets rid of all the overhead by keeping everything in-process.\nIf you do really need to have a microservice architecture (ex: to scale another service up independently),\nthen other communication + data exchange formats may improve your performance.\nBut if it's possible to keep everything in-process, your overhead is roughly zero.\nThat's hard to beat!</p>\n<h1><a href=\"#conclusion\" aria-hidden=\"true\" class=\"anchor\" id=\"conclusion\"></a>Conclusion</h1>\n<p>In the end, the new pipeline was 4x the speed of the old.\nI happened to rewrite it in Rust, but Rust itself wasn't the source of all the speedups:\nunderstanding the architecture was.\nYou could achieve similar results in Node.js or Python,\nbut Rust makes it significantly easy to reason about the architecture and correctness of your code.\nThis is especially important when it comes to parallelizing sections of a pipeline,\nwhere Rust's type system will save you from the most common mistakes.</p>\n<p>These and other non-performance-related reasons to use Rust will be the subject of a future blog post (or two).</p>\n",
      "summary": "",
      "date_published": "2024-11-29T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "algorithms",
        "rust",
        "elasticsearch",
        "nodejs",
        "data engineering",
        "gis"
      ],
      "language": "en"
    },
    {
      "id": "https://ianwwagner.com//searching-for-tiger-features.html",
      "url": "https://ianwwagner.com//searching-for-tiger-features.html",
      "title": "Searching for TIGER Features",
      "content_html": "<p>Today I had a rather peculiar need to search through features from TIGER\nmatching specific attributes.\nThese files are not CSV or JSON, but rather ESRI Shapefiles.\nShapefiles are a binary format which have long outlived their welcome\naccording to many in the industry, but they still persist today.</p>\n<h1><a href=\"#context\" aria-hidden=\"true\" class=\"anchor\" id=\"context\"></a>Context</h1>\n<p>Yeah, so this post probably isn't interesting to very many people,\nbut here's a bit of context in case you don't know what's going on and you're still reading.\nTIGER is a geospatial dataset published by the US government.\nThere's far more to this dataset than fits in this TIL post,\nbut my interest in it lies in finding addresses.\nSpecifically, <em>guessing</em> at where an address might be.</p>\n<p>When you type an address into your maps app,\nthey might not actually have the exact address in their database.\nThis happens more than you might imagine,\nbut you can usually get a pretty good guess of where the address is\nvia a process called interpolation.\nThe basic idea is that you take address data from multiple sources and use that to make a better guess.</p>\n<p>Some of the input to this is existing address points.\nBut there's one really interesting form of data that brings us to today's TIL:\naddress ranges.\nOne of the TIGER datasets is a set of lines (for the roads.\nEach segment is annotated with info letting us know the range of house numbers on each side of the road.</p>\n<p>I happen to use this data for my day job at Stadia Maps,\nwhere I was investigating a data issue today related to our geocoder and TIGER data.</p>\n<h1><a href=\"#getting-the-data\" aria-hidden=\"true\" class=\"anchor\" id=\"getting-the-data\"></a>Getting the data</h1>\n<p>In case you find yourself in a similar situation,\nyou may notice that the data from the government is sitting in an FTP directory,\nwhich contains a bunch of confusingly named ZIP files.\nThe data that I'm interested in (address features)\nhas names like <code>tl_2024_48485_addrfeat.zip</code>.</p>\n<p>The year might be familiar, but what's that other number?\nThat's a FIPS code for the county whose data is contained in the archive.\nYou can find a <a href=\"https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt\">list here</a>.\nThis is somewhat interesting in itself, since the first 2 characters are a state code.\nTexas, in this case.\nThe full number makes up a county: Wichita County, in this case.\nYou can suck down the entire dataset, just one file, or anything in-between\nfrom the <a href=\"https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html\">Census website</a>.</p>\n<h1><a href=\"#searching-for-features\" aria-hidden=\"true\" class=\"anchor\" id=\"searching-for-features\"></a>Searching for features</h1>\n<p>So, now you have a directory full of ZIP files.\nEach of which has a bunch of files necessary to interpret the shapefile.\nIsn't GIS lovely?</p>\n<p>The following script will let you write a simple &quot;WHERE&quot; clause,\nfiltering the data exactly as it comes from the Census Bureau!</p>\n<pre><code class=\"language-bash\">#!/bin/bash\nset -e;\n\nfind &quot;$1&quot; -type f -iname &quot;*.zip&quot; -print0 |\\\n  while IFS= read -r -d $'\\0' filename; do\n\n    filtered_json=$(ogr2ogr -f GeoJSON -t_srs crs:84 -where &quot;$2&quot; /vsistdout/ /vsizip/$filename);\n    # Check if the filtered GeoJSON has any features\n    feature_count=$(echo &quot;$filtered_json&quot; | jq '.features | length')\n\n    if [ &quot;$feature_count&quot; -gt 0 ]; then\n      # echo filename to stderr\n      &gt;&amp;2 echo $(date -u) &quot;Match(es) found in $filename&quot;;\n      echo &quot;$filtered_json&quot;;\n    fi\n\n  done;\n</code></pre>\n<p>You can run it like so:</p>\n<pre><code class=\"language-shell\">./find-tiger-features.sh $HOME/Downloads/tiger-2021/ &quot;TFIDL = 213297979 OR TFIDR = 213297979&quot;\n</code></pre>\n<p>This ends up being a LOT easier and faster than QGIS in my experience\nif you want to search for specific known attributes.\nEspecially if you don't know the specific area that you're looking for.\nI was surprised that so such tool for things like ID lookps existed already!</p>\n<p>Note that this isn't exactly &quot;fast&quot; by typical data processing workload standards.\nIt takes around 10 minutes to run on my laptop.\nBut it's a lot faster than the alternatives in many circumstances,\nespecilaly if you don't know exactly which file the data is in!</p>\n<p>For details on the fields available,\nrefer to the technical documentation on the <a href=\"https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html\">Census Bureau website</a>.</p>\n",
      "summary": "",
      "date_published": "2024-11-09T00:00:00-00:00",
      "image": "",
      "authors": [
        {
          "name": "Ian Wagner",
          "url": "https://fosstodon.org/@ianthetechie",
          "avatar": "media/avi.jpeg"
        }
      ],
      "tags": [
        "gis",
        "shell",
        "ogr2ogr"
      ],
      "language": "en"
    }
  ]
}