1.7 million hotels, kept fresh — the inventory engine behind a global green marketplace

Most people who book a hotel online never think about what happens behind the search box. They type a city, pick a date, see a list, click. The list feels like a list. It isn't. It's the live output of a long, fragile pipeline that pulls from dozens of suppliers, reconciles their disagreements about which hotel is which, decides whose price to trust at this exact second, and stamps a carbon figure on the room before it shows up on the page. At IMPT.io we run that pipeline across 1.7 million hotels in 195 countries. This essay is about how we keep it honest.

The supplier graph is not a list of APIs

When founders describe a global hotel API stack, they usually draw a tidy diagram: a few wholesalers feeding a normaliser feeding a search index. The reality is messier. Each supplier has its own view of the world. One sees a property as a single hotel; another splits it into three room-type sub-properties; a third sells the same building under a chain code that changed names two years ago. None of them are wrong. They're just describing different slices of the same physical bed.

So we don't treat suppliers as a list. We treat them as a graph. Every supplier is a node. Every hotel they sell is an edge into a candidate cluster. Every cluster is a hypothesis about a real-world property. The graph holds disagreements rather than hiding them, and we resolve disagreements at query time rather than pretending we baked them out at ingest. That sounds expensive. It is. But the alternative — picking one supplier as the source of truth and discarding the rest — is how marketplaces end up with phantom inventory and broken booking links at 11pm on a Friday.

De-duplication is the hard part nobody talks about

If you've never tried to de-duplicate hotels at scale, here is the short version: the data is dirty in every dimension you care about. Names are translated, transliterated, abbreviated, or branded under a parent group that the supplier hasn't updated. Addresses are in local script, or they use the postal town instead of the city, or the building has two street numbers because it sits on a corner. Geocodes drift by 50 metres because someone clicked the wrong pin. Phone numbers are reception, sales office, or a call centre in a different country.

The naive approach is fuzzy string matching on names plus a distance threshold on coordinates. It gets you about 80% of the way and then quietly destroys the rest. Two hotels can sit 30 metres apart and be entirely different businesses. A small guesthouse in Kinsale and the building next door are not the same hotel just because their pins overlap.

What works for us is layered. We start with strong signals — chain codes, GIATA IDs where suppliers expose them, exact-match phone numbers, exact-match domain names on the hotel's own site. These give us a high-confidence spine. Then we use weaker signals to attach candidates to that spine: name similarity weighted by the rarity of the tokens, address tokens after normalisation, geocode within a tight radius, photo perceptual hashes when we have them. Each candidate match carries a score. Above a threshold it joins the cluster. Below, it sits in a review queue. In between, it goes to the agent — more on that below.

Price and availability is a cache problem, not a database problem

People assume hotel prices live in a database. They don't. They live in the suppliers' systems and they change without warning. A room that was €180 at lunchtime can be €240 by the time someone finishes their coffee and clicks through. Availability is even worse — a property can sell out for a date in the time it takes a search page to render.

So the question isn't "where do we store prices". The question is "how stale are we willing to be, and where". We run a tiered cache. The hot tier holds quotes for the most-searched city/date combinations and refreshes aggressively. The warm tier holds quotes for properties that have been seen in the last few days. The cold tier holds nothing — we go to the supplier on demand and accept the latency. The trick is deciding which tier a query belongs to before you've answered it.

Two practical rules we've ended up with:

Never show a price you can't honour. If the cache is stale beyond a tolerable window for that property, hide the price and re-quote on click. A missing price is a worse experience than a wrong one, but a wrong one is worse than both.
Treat availability and price as separate problems. A property can be available at a price you don't yet know. Caching them as a single object means a stale price invalidates valid availability. Splitting them halved our quote-on-click failures.

Carbon enrichment: the layer that makes us a sustainable hotel directory rather than just a hotel directory

Every booking on IMPT offsets one tonne of CO₂ on-chain, paid out of our commission rather than added to the guest's bill. That commitment only works if the carbon side of the data is as well-kept as the price side. So we enrich every property in the graph with a footprint estimate — based on country grid intensity, property type, star rating where reliable, and any verified sustainability data the property itself publishes. The estimate isn't the offset. The offset is a flat tonne, retired on-chain, regardless of how the maths shakes out for any individual stay. The estimate is for transparency: it lets a guest see what their night actually cost the atmosphere, and it lets us flag properties that are genuinely doing the work versus ones with a recycling bin and a press release.

We refresh the carbon layer on a slower cadence than price — a footprint estimate doesn't change between Tuesday and Wednesday — but it has to refresh. Grid intensities shift as countries add renewables. Properties retrofit. New verified data arrives. Treating carbon as a one-time enrichment is how greenwashing gets baked into a marketplace. Treating it as a managed cache, like price, is how you keep the directory honest.

The AI agent that replaced three humans

Until recently, three things in the pipeline needed people. Reviewing borderline de-duplication candidates. Reconciling supplier conflicts where two feeds disagreed about whether a property was open, closed, or renamed. And triaging carbon-data anomalies where a property's published footprint disagreed sharply with our model.

All three were the same shape of work: read several pieces of conflicting evidence, weigh them against priors, write a decision with a justification, and move on. That shape of work is what large models do well now, provided you constrain the inputs and demand a structured justification on the way out.

So we built an agent. It runs the review queue overnight. For each item, it pulls the candidate evidence — supplier records, the hotel's own website where reachable, recent reviews for sanity-check signals, photos for perceptual comparison — and produces a decision plus a confidence and a written reason. High-confidence decisions apply automatically. Low-confidence ones still go to a human, but the human now reads a structured brief instead of starting cold.

Two honest observations about this. First, the agent is not better than a careful human at any single decision. It is better at being awake at 3am for the eighty-thousandth one. Second, the value isn't replacing headcount — it's collapsing the latency between a supplier change and the marketplace reflecting it. Hotel data freshness is the whole game in travel marketplace tech, and freshness is a function of how fast you can resolve the ambiguities the world keeps generating.

What the inventory engine looks like end to end

Pulled together, the hotel inventory engine has five layers:

Ingest. Supplier feeds, normalised into a common shape, with provenance kept on every field.
Graph. Properties as clusters of supplier observations, with disagreements held rather than discarded.
Cache. Tiered price and availability, refreshed by demand signal rather than schedule.
Enrichment. Carbon footprint, sustainability signals, photos, descriptions — refreshed on their own cadence.
Agent. The decision layer that handles ambiguity at the speed the rest of the system needs.

Each layer fails in its own way. The ingest layer fails when a supplier silently changes a field. The graph fails when a chain rebrands. The cache fails when traffic patterns shift. The enrichment fails when a country's grid changes faster than we re-run the model. The agent fails when the evidence is genuinely insufficient and it should have escalated. We've learnt to instrument every layer separately, because a single end-to-end metric — "is search working" — hides which layer broke until it's too late.

What we're doing about it this week

The thing we're shipping this week is more boring and more useful than the agent: a public freshness dashboard for our own team, showing for each country how stale our price cache is at any given hour, and how many cluster decisions are sitting in the queue. If you're building anything with a global hotel API or running your own sustainable hotel directory, the action item is the same — pick the one freshness metric you would be most embarrassed to have a customer discover before you did, and put it on a wall. Then go fix whichever layer it points at. We've been doing this for years and the wall still surprises us most weeks. That's the job.