Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTFS Schedule: We should check for file-level deletion in type 2 / SCD logic #1184

Closed
lauriemerrell opened this issue Mar 7, 2022 · 11 comments
Assignees
Labels
bug Something isn't working project-gtfs-schedule For issues related to gtfs-schedule project size: L one-week effort

Comments

@lauriemerrell
Copy link
Contributor

In December, we changed URL 0 for SolTrans. The old feed had a feed_info.txt file, while the new feed does not. However, the pipeline did not mark the old feed_info as deleted when that change occurred.

This means that we still show that old feed_info file as active (for example it is still in feed_info_clean with calitp_deleted_at = 2099-01-01, i.e., "active").

This leads to information from that old feed appearing in the reports for SolTrans even though the rest of the feed is correct and updated, because some fields in the report are calculated from feed_info.

We need to figure out why feed_info wasn't marked as deleted in this case.

@lauriemerrell lauriemerrell added the project-gtfs-schedule For issues related to gtfs-schedule project label Mar 7, 2022
@lauriemerrell lauriemerrell self-assigned this Mar 7, 2022
@lauriemerrell
Copy link
Contributor Author

AFAICT, we don't check in our pipeline to see if an overall file has been deleted. gtfs_schedule_history2.merge_updates.py makes row-level comparisons for files that are present on both sides, but it looks at processed files. So if you have a file that was deleted, it has no way to know. We do mark feeds as deleted (not entirely sure where but I confirmed that for example if you remove a feed from agencies.yml it gets a calitp_deleted_at date in gtfs_schedule_dim_feeds).

We need to make merge_updates handle deleted files.

@lauriemerrell
Copy link
Contributor Author

Affected feed is 310.0

@lauriemerrell
Copy link
Contributor Author

@holly-g, I'm removing the needs investigation label -- I know what's going on here.

We need to add a second check either before or after merge_updates that checks the files present in a feed day over day and, if a file has been deleted, we need to populate its entries' calitp_deleted_at date. (I.e. - I don't think that investigation is needed, the cause here is known, we have a mechanism to find these.)

Affected files are saved in cal-itp-data-infrastaging:laurie_test_general.gtfs_schedule_type2_deleted_files_investigation, they are the results of the following query, run on 2022-03-09:

WITH check_next_file AS (
    SELECT *,
        LEAD(md5_hash)
            OVER (PARTITION BY calitp_itp_id, calitp_url_number, name ORDER BY calitp_extracted_at)
            AS next_md5_hash,
        DATE_ADD(calitp_extracted_at, INTERVAL 1 DAY) as next_day
    FROM `cal-itp-data-infra.gtfs_schedule_history.calitp_files_updates`
),
check_next_feed AS (
SELECT * EXCEPT(calitp_extracted_at),
    calitp_extracted_at AS next_day
FROM `cal-itp-data-infra.gtfs_schedule_history.calitp_feed_status`
),
deleted_files AS (
    SELECT *
    FROM check_next_file
    LEFT JOIN check_next_feed USING(calitp_itp_id, calitp_url_number, next_day)
    WHERE next_md5_hash IS NULL and is_extract_success = true
    ORDER BY calitp_extracted_at DESC
),
today_files AS (
    SELECT name, calitp_itp_id, calitp_url_number, true as is_present_now
    FROM `cal-itp-data-infra.gtfs_schedule_history.calitp_files_updates`
    WHERE calitp_extracted_at = '2022-03-08'
)

SELECT * 
FROM deleted_files 
LEFT JOIN today_files using (calitp_itp_id, calitp_url_number, name)
ORDER BY calitp_extracted_at DESC 

@lauriemerrell
Copy link
Contributor Author

I wonder if a version of the logic that we use to get the latest feed for GTFS schedule would work here. Example query:

SELECT *
FROM `cal-itp-data-infra.gtfs_schedule_type2.feed_info_clean`
LEFT JOIN `cal-itp-data-infra.gtfs_schedule_history.calitp_feed_status`
    USING(calitp_itp_id, calitp_url_number, calitp_extracted_at)
WHERE calitp_itp_id = 310 and calitp_url_number = 0
ORDER BY calitp_extracted_at DESC

Here, is_latest_load shows as false for the most recent data.
If there's a file that has calitp_deleted_at = 2099-01-01 but is_latest_load is false in this query, that is something that should be marked as deleted, I think.

There may be a preferable way to get this info from a later calitp status type file, I would need to think about it more.

@lauriemerrell
Copy link
Contributor Author

lauriemerrell commented Mar 29, 2022

Update: see #1280 -- the proposal immediately above about is_latest_load is not right. Row-level versioning isn't going to align with feed-level extract dates unless/until we change merge_updates to do versioning at the file level. 😕

@lauriemerrell
Copy link
Contributor Author

lauriemerrell commented Mar 30, 2022

Another update: We need to look at is_interpolated here.... I think a lot of these cases will have is_interpolated = True and we need to investigate to what extent that should be preserved. Like, I don't know why we wouldn't be marking things as deleted in these cases, and why we would ever interpolate....

@lauriemerrell lauriemerrell changed the title GTFS Schedule: Investigate why SolTrans feed_info.txt did not get marked as deleted when URL changed GTFS Schedule: We should check for file-level deletion in type 2 / SCD logic Mar 30, 2022
@lauriemerrell lauriemerrell added the size: L one-week effort label Apr 1, 2022
@lauriemerrell lauriemerrell removed their assignment Apr 5, 2022
@holly-g holly-g added the bug Something isn't working label Apr 7, 2022
@lauriemerrell
Copy link
Contributor Author

We should do this as part of #1259

@o-ram
Copy link
Member

o-ram commented Jun 7, 2022

This issue was seen again in 2 separate instances when the URLs for both Santa Maria Area Transit and Anaheim Regional Transit were updated. Both agencies switched vendors from Trillium to TripShot resulting in the URL change. The new TripShot feeds do not have a feed_info.txt file and it appears that the pipeline is pulling the old Trillium feed info information instead of updating it to reflect the new feed. The issue seems to be more pronounced for Santa Maria Area Transit, where the pipeline is also pulling from the old Trillium feed for their Fares v2 data as well (the TripShot feed does not include fares data). This may be because Anaheim's name slightly changed in our pipeline while Santa Maria's remained the same.

Example # 1: Anaheim Regional Transit
Screenshot of Metabase
Screen Shot 2022-06-06 at 4 58 27 PM

Screenshot of May report
Screen Shot 2022-06-06 at 5 06 11 PM

Example # 2: Santa Maria Area Transit
Screenshots from Metabase
Screen Shot 2022-06-06 at 4 58 40 PM
Screen Shot 2022-06-06 at 4 58 57 PM
Screenshots from May report
Screen Shot 2022-06-06 at 5 06 11 PM
Screen Shot 2022-06-06 at 5 14 11 PM

@lauriemerrell
Copy link
Contributor Author

Thanks @o-ram for sharing these examples, it's very helpful to know that this is happening... cc @holly-g & @edasmalchi from a reports data display perspective -- wondering whether we think that this needs to be fixed in current data structure (i.e., within the next like 6 weeks) or whether it can wait until we refactor the versioning in the schedule pipeline (could take a little while but will likely be a more robust fix.)

Tagging this against #1536, I'm actually not sure that we're going to do #1259 under the current pipeline any more

@lauriemerrell
Copy link
Contributor Author

@evansiroky and @e-lo -- I'm wondering if we should be updating the Transit Tech Stacks Airtable information to note the seeming changes in tech stack that Olivia noted above in #1184 (comment)

@atvaccaro
Copy link
Contributor

atvaccaro commented Jan 26, 2023

Per conversation in sprint planning, going to close this given an example of us handling file deletion properly. Since we version at the feed (aka extract) level, it's easy to see where a base64_url no longer has a file.

select *
from mart_gtfs.dim_schedule_feeds
where key in (
              '4a433f24ee4264e77a1306b4801fe238',
              '4c84830c9f86b096a26dde4df6c8a185'
    )
order by _valid_from
;

select distinct original_filename
from mart_gtfs_quality.fct_schedule_feed_files
where feed_key = '4a433f24ee4264e77a1306b4801fe238'
except
distinct
select distinct original_filename
from mart_gtfs_quality.fct_schedule_feed_files
where feed_key = '4c84830c9f86b096a26dde4df6c8a185'
;

select *
from mart_gtfs.dim_feed_info
where feed_key in (
                   '4a433f24ee4264e77a1306b4801fe238',
                   '4c84830c9f86b096a26dde4df6c8a185'
    )
;

Also closing related Schedule issues as "not planned."

#521
#1344

@atvaccaro atvaccaro closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working project-gtfs-schedule For issues related to gtfs-schedule project size: L one-week effort
Projects
None yet
Development

No branches or pull requests

4 participants